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ABSTRACT 

This feasibility study for developing a measure of 
effectiveness in reading contains five sections. "The Need and 
Requirements for a Measure of Effectiveness in Reading" presents the 
problem, functional specifications for a measure of effectiveness in 
reading, the minimum number of tasks required to uild an 
effectiveness measure, approaches to the measurement of effectiveness 
in reading, the Riverside Research Inb tute (RRI) approach toward 
the development of a measure of effectiveness in reading, and the RRI 
approach and the minimum work tasks for developing an effectiveness 
measure in reading. "Measuring Word Familiarity" discusses scales of 
word frequency, word frequencies and the lognormal distribution, 
construction of a word familiarity scale, and a familiarity-based 
vocabulary measure* "Measuring the Readability of English Texts" 
discusses the problem, a new readability formula, and construction of 
the RRI readability formula. "Implementing the Design Concepts in the 
Construction of Reading Tests" presents a plan for the construction 
of nonbiased tests and for computer-assisted tests. "Application of 
the Design Concepts for Qunatifying English Text in Setting and 
Monitoring Standards" discusses input data for setting standards and 
analysis of effective data. (NR) 
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Chapter I. 

The Need and Requirements for a Measure of 
Effectiveness in Reading 

A. The Problem 

New York State spends at least a billion dollars a year 
on reading related instruction. This sum is appropriated 
in the belief that expenditures of this magnitude are requiied 
to provide quality education in reading to all students in the 
State. Quality education is currently defined primarily in 
terms of input factors, e.g., high quality is associated with 
high per capita costs, elaborate facilities, large nur.ibers of 
personnel, and so on. In New York State, as elsewhere, the 
outcomes of instruction, i.e., whether and how well students 
are, in fact, learning to read, are not emphasized in definitions 



New York State spends approximately six billion dollars 
annually foi elementary and secondary education (SED, 1973, 
p. 9). On the hypothesis that this six bilHon is divided 
equally over grades K-12, $461.5 million is spent per grade. 
In the following table we have assumed that the proportion of 
instructional time devoted to reading related activities varies 
according to grade. Documentation exists to support the assump- 
tion of 31% for grades 1-3 (0E0# B005114) . 



% Time for Reading 

Grade Related Instruction Cost s in Mi llions 

K 20 $ 92.3 

1,2,3 31 $ 429.2 

4,5,6 20 $ 276.9 

7,8,9,10,11,12 15 S 415.4 

$1,213.8 



The accuracy of this estimate may be subject td some question. 
For example, it might be argued that less time is actually 
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instruction — is not emphasized in definitions of educational 

quality This emphasis on inputs rather than outcomes has 

been summarized in the 1972 General Information Yearbook pub- 

blished by the National Assessment of Educational Progress: 

The only available measures of educational quality 
resulting from this investment [of billions of 
dollars] had been based upon inputs into the educa- 
tional system such as teacher-stuaent ratios, number 
of classrooms, and number of dollars spent per stu- 
dent. The tenuous assumption had been that the 
quality of educational outcomes — what students 
actually learn — was directly related to the quality 
of the inputs into the educational system. No 
significant direct assessment of educational out- 
comes had been made. (National Assessment of Educa- 
tional Progress, 1972, p.l) 

1. Current measures of outcomes in reading . The National 
Assessment of Educational Progress, in the quotation reproduced 
above, notes that "no significant [RRI's italics] direct assess- 
ment of educational outcomes has been made." The 
outcomes of reading instruction have not been entirely ignored; 
adult literacy is surveyed periodically, and schools frequently 
administer reading achievement tests. However, as the following 
sections of the report will show, these procedures do not 
directly assess reading outcomes because they do not yield 



devoted to reading-related instruction in grades 7-12 than we 
have estimated, malting the cost of reading instruction less 
than $1.2 billion. On the other hand, it might be argued that, 
because of the drop-out rate in the secondary schools, the 
$6 billion costs are not distributed evenly over the grades, 
but rather that proportionately more resources are allocated 
to the elementary grades than to the secondary grades. If 
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information that will permit a determination of what students 
have learned as a consequence of receiving reading instruction, 
i.e., how well students actually read. 

a. Adult literacy . At the national level, adult 
literacy has been used as a measure of one outcome of schooling. 
The assumption is made that the existence of a literate adult 
population is evidence that the educational system is working 
effectively. For this purpose, literacy commonly has been ie- . 
fined simply in terms of years of schooling: a person is con- 
sidered literate if he or she has completed a specified number 
of years of formal education. The number of years of schooling 
taken to define literacy varies among government agencies, but 
is currently in the range of five to eight years. 

This definition of literacy does not constitute 
«in adequate measure of the outcomes of instruction. Knowing how 
long a person has been in school does not necessarily provide 
any information about that person's reading ability. Studies 
indicate that reading ability, measured using standardized tests. 



this is the case, the estimate of total reading instruction 
costs would need to be revised upwards (because of the greater 
proportion of time given to reading instruction in the elemen- 
tary grades) . It might also be argued that the application of 
Title I and New York State Urban Education funds to the teaching 
of reading raises the total costs of reading-related instruc- 
tion. While such arguments (or others) would alter estimates 
of reading costs, we believe that the calculation shown above 
is a conservative estimate of the annual cost of reading 
instruction. 
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is frequently three to four grades below the number of years of 
schooling that the persons being tested have completed. One 
study conducted in the Woodlawn area of Chicago found that, 
although more than 90% of the persons sampled had completed 
at least the sixth grade, over 50% of them proved to be func- 
tional illiterates on the basis of achievement test results 
(Hilliard, 1963). Therefore, reliance on grade -completion 
criteria to define literacy provides little if any useful in- 
formation concerning the real consequences of educational pro- 
grams on the reading ability of students. 

b. Standardized test scores . School districts and 
state education departments typically measure the outcomes of 
reading instruction by administering standardized, norm- 
referenced reading achievement tests to students. Agencies of 
the federal government also appear to be moving toward the use 
of performance on such tests to define literacy. Howeve:-:, 
scores on norm- referenced tests are inadequate measures of the 
outcomes of instruction because they do not provide information 
concerning either the attainment of standards of reading com- 
petence or the acquisition of particular reading skills. 

Grade norms are widely misinterpreted. It is 
widely believed that these norms define standards of reading 
competence for each grade, i.e., that they define an objectively 
determined level of performance that all children in that grade 
should be able to reach. It is not generally understood that 
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a grade "norm" is defined simply by the average test score that 
a sample of students of a given grade level did achieve during 
standardization of the test. Grade norms are established with- 
out regard for the. particular levels of reading competence 
demonstrated by students; they depend only on the observed 
distribution of test scores earned by subjects in tho stan- 
dardization sample. Thus, norm-referenced scores cannot be 
used to determine whether students meet performance standards 
in reading. 

Norm- referenced test scores are not directly 
interpretable with respect to what students have learned and, 
consequently, they provide no direct indication of how well 
a student will perform on any reading tasks that may be en- 
countered in everyday life. If a twelfth-grade student obtains 
a reading score that places him at the twelfth-grade norm, no 
conci' sion can be drawn concerning his ability to cope success- 
fully with the reading tasks that he will meet in the adult 
world. All that can be inferred from this score is that his 
reading performance, compared with the performance of others 
in his age group i is about average on a particular set of test 
items. Perhaps the average twelfth-grade reader can read most 
of the adult materials that he will encounter, but there is no 
inherent property of the set of test items or of the test score 
that supports this conclusion. Being average, or even above 
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average, in relation to one's peers is no guarantee of com- 
petence on specific reading tasks . 

It is not surprising, then, that the quality of 
education is currently defined primarily in input terms, con- 
sidering that the common definition of literacy only takes 
account of how long a person has been in school rather than 
measuring his or her reading capability, and considering that 
norm-referenced tests only discriminate between different 
persons' performances on a non-generalizable set of tasks 
rather than providing a directly interpretable measure of 
reading skills. If the quality of education is to be defined 
in terms of outcomes, a new and different measure of reading 
ability is required. 

2 . The justification for developing a new measure of 
effectiveness in reading . The development of a new measure of 
effectiveness in reading can be justified in several ways. 
First of all, it can be justified in terms of the need for 
documented answers to several important questions that cannot 
be addressed substantively until a new measure of reading 
ability is available. An outcome measure of reading ability 
is needed to evaluate the different methodologies used to 
teach reading (e.g., different ways of organizing curricula 
and sequencing instructional activities) in terms of their 
long-tena effectiveness. Furthermore, a new measure is needed 
to give concrete meaning to the phrase "equal educational 
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opportunity" through a study of the ultimate consequences of 
different programs in which resources are applied to overcome 
socioeconomic class differences among students. 

The most important justification for developing a 
new effectiveness measure, however, is that it is urgently 
required to give substance to two important public processes 
in education: system accountability, and the allocation of 
resources (i.e., budget-making). 

Effectiveness measures are an essential component 
of accountability processes in any field. During the last 
decade, much has been written and said about the need for 
system accountability in education. Public discussion of the 
matter nas centered on two aspects of accountability. First, 
there has loeen a demand for demonstrated results from educa- 
tional programs. The satisfaction of this demand requires 
measures that clearly show what students have learned as a 
consequence of receiving instruction. Second, the public has 
asked educational professionals to affirm with their constitu- 
ents the specific educational objectives they have chosen to 
pursue and the means they are using to reach these objectives. 
The public is especially anxious to receive explanations for 
failures. To meet these demands, there is a need to document 
the relationship between alternative programs (clearly defined 
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with respect to objectives, methodology, implementation, and 
so on) and their measured effectiveness. 

The need for effectiveness measures is equally 
critical in budget-making. The public budget-making process 
in education results in a series of resource allocation deci- 
sions . Tnere is never enough money available in education to 
do all the things that educators or the public wish to do. 
Therefore, decisions must be made to spend the money (i.e., 
to allocate the available resources) on one set of educational 
programs rather than another. If such decisions are to be 
made rationally, they must be based on the expected, measurable 
effectiveness of different educational programs relative to 
their costs. 

Since the effectiveness of an educational program 
can only be judged in terms of what it has actually accom- 
plished (that is, in terms of what students have learned), 
public resource allocation processes cannot take place ratio- 
nally unless and until there are effectiveness measures availa- 
ble that provide directly interpretable data demonstrating 
what students have learned from different instructional programs. 
Furthermore, since budget-making is a public process, it would 
be desirable to present program effectiveness information in a 
form tiiat citizens can readily understand, thus facilitating 
their informed participation. 
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B. The Functional Specifications for a Measure of Effective - 
ness in Reading 

The properties that are desired in a measure of effective- 
ness in reading constitute a set of functional specifications. 
These functional specifications are as follows : 

1 . The capability to measure individual reading effec - 
tiveness . Since education is concerned with the develop .lant 
of individuals, the measure must yield reliable individual 
scores of reading comprehension. 

2. The capability to measure system effectiveness . It 
must be possible to aggregate the scores of individuals (by 
grade, sex, ethnicity, etc.) to determine how well the educa- 
tional system is performing for different target groups in 
different schools, districts, regions, and statewide. 

3. The capability to measure progress toward adult 
reading competence . The test must be able to measure the pro- 
gress of individuals (and groups) toward becdming competent 
adult readers. 

• It must measure the ability to cope with societal 
reading requirements imposed by law, such as com- 
prehending income tax forms or drivers' license 
applications, and with other materials intended 
by government agencies for the protection and 
well-being of citizens. 
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• It must measure the ability to read materials 
necessary to enter various vocations or profes- 
sions. 

• It must measure the ability to read materials 
that enable individuals to function competently 
in their own behalf, such as advertisements, 
insurance policies, repair manuals, etc. 

4, The capability to measure growth in reading ability. 
The measure should be able to detect small changes in reading 
ability, such as might be expected to occur in one year's time. 
Measurement of group growth is an essential requirement of the 
measure. Measurement of individual growth, if feasible, is 

highly desircible. 

5, The capability to measure reading ability over the 
entire school age range. Continuity of measurement, beginning 
in the primary grades, is necessary for measuring progress to- 
ward adult competence and for detecting growth. Therefore, 
the measure should be applicable over all or nearly all of the 
public school age range. 

6, The capability to furnish meaningful scores . Scores 
on the measure should be readily and accurately understood by 
persons without technical knowledge of statistics or test con- 
struction procedures, such as parents, legislators, teachers, 
etc. Therefore, it must be possible to present scores in terms 
that are meaningful to such persons without sacrificing pre- 
cision in reporting. 
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C. The Mininmn Number of Tasks Required to Build an Effec- 
tiveness Measure 

There is a minimum number of tasks that must be executed 
in order to construct a reading effectiveness measure. 

1. Legislative-political tasks . The task of building 
an effectiveness measure logically requires a clear statement 
of the objectives that the educational system is trying to 
achieve. Therefore, it is desirable that the persons who are 
empowered to do so define the standards or expectations of 
reading competence. 

The actual setting of standards (a matter of value 
judgment) is outside the province of science; rather, it is 
the job of government. However, scientists can contribute 
sound, impartial technical work to describe adult reading 
requirements, and to define and analyze the consequences of 
alternative standards, so that government can choose among 
alternatives as rationally as possible. Since reading demands 
(and leuiguage) change over time, and since students entering 
school need to be prepared to meet the reading requirements 
that they will face as- adults approximately 15 years later, 
the analytic work carried out by scientists should include 
some amount of forecasting. 
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2. Scientific-technical tasks . Whether or not formal 
standards are established, the scientific and technical tasks 
to be carried out in building a measure of effectiveness in 
reading remain essentially the same. The technical task, how- 
ever, is simplified soinewhat when standards have been set and 
measurement need only determine whether or not those standards 
are met. These tasks are as follows: 

• To define adult reading tasks . Identify the various 
kinds of materials that adults are called upon to 
read. 

• To scale adult reading tasks . It is reasonable to 
assume that the number of adult reading tasks will 
be too large to test students' ability on all of 
them. The large number of adult reading tasks sug- 
gests that an approach which treats reading tasks 
individually will be less productive than one which 
scales reading tasks according to the extent to 
which they share one or more properties. Reading 
tasks with similar scale values can be clustered 
into groups. With the tasks organized or clustered 
in groups, performance on a given task would allow 
valid inferences to be made about an individual's 
performance on any other task within the same group. 
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• To define "reading comprehension. " At the outset of 
test construction, it is necessary to define the con- 
struct "reading comprehension," i.e., to specify the 
cognitive skills to be encompassed by this construct, 
so that appropriate test items may be chosen. Fur- 
thermore, criteria of comprehension must be specified. 
These criteria define the test performance to be 
accepted as evidence that a student satisfactorily 
comprehends what he has read. 

• To carry out the technical development of the test : 

- To select item formats; 

- To demonstrate construct validity (that 
is, having defined the construct "reading 
comprehension," to demonstrate that the 
tests used are valid measures of this 
construct) ; and 

- To determine test relieU^ility (and to 
develop new procedures for calculating 
reliability, if needed) . 
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D, Existing Approaches to the Measurement of Effectiveness in 
/- "lidding 

y In recent years, there have been a number of large-scale 
reading projects related to the measurement of effectiveness 
in reading. In the following sections, several of the more 
important efforts will be examined and reviewed in relation 
to the functional specifications and minimum work tasks outlined 
above. 

1. The National Assessment of Educational Progress . 
Reading is one of ten subject areas covered in the National 
Assessment of Educational Progress (NAEP) currently being con- 
ducted under the auspices of the Education Commission of the 
States. The purpose of NAEP is to collect census-like data on 
a nationwide basis concerning the educational achievement of 
Americans in selected content areas. The NAEP's plan calls for 
periodic retesting to detect changes in achievement. 

When the decision was made to undertake NAEP in reading, 
panels of reading specialists, educators, and test developers 
were convened to define reading objectives that would represent 
"a set of goals which are agreed upon as desirable directions 
in the education of children" (National Assessment of Educa- 
tional Progress, 1970, p.2) . The draft objectives agreed to by 
the panelists were submitted to groups of lay citizens to en- 
sure that the objectives to be measured would be perceived as 
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imjortant by the p\.\blic. After the reading objectives were, 
decided upoD, professional item writers prepared test "exer- 
cises" to measure those objectives in four age groups: 9, 13 ^ 

2 

17, and 26-35. Each objective was measured in all age groups, 
but the test items differed by age, since a decision had been 
made to try to keep the, median percentage of success at 50% 
per objective per grade. During 1970-71, 500 test exercises 
were administered to approximately 100,000 subjects in the four 
age groups. 

1.1 NAEP's measure and the functional specifications for 
an effectiveness measure in reading . Although NAEP is an ambi- 
tious undertaking that provides a great deal of descriptive 
data about the reading achievement of students and young adults, 
it does not meet all the functional specif icatidns for an effec- 
tiveness measure in reading. 

a. Individual scores . NAEP does not provide 
individual scores. 

b. System scores . NAEP does not provide scores 
for schools, districts, or states, though such data could 



The objectives are: to comprehend what is read; to analyze 
wnat is read; to use what is read; to reason logically from 
what is read; and to make judgments about what is read. Another 
objective — to have attitudes about and an interest in reading — 
was agreed upon, but was not assessed at all in the first 
national assessment of reading. 

« 
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presumably be provided if needed. NAEP does provide data for 
various regions of the country and by various types and sizes 
of communities . 

c . iMeasurement of progress toward adult competence * 
NAEP provides no means for measuring progress toward adult com- 
petence. Since adult reading competence levels were not defined, 
progress toward such competence logically cannot be measured. 
(The reading objectives that are measured pertain to desirable 
skills that any reader should have; no distinctions are made 
between objectives for various age groups.) 

d. Measurement of growth . NAEP cannot measure 
growth in reading 'or any individual or group, since there is 
no known relationship between the difficulties of the exercises 
used in tiie tests constructed for the different age levels. 
Some exercises were administered at two or three age levels , 
but only a small number were administered at all age levels. 
Since the exercises were not scaled for difficulty, differences 
in scores on different tests administered over time to the 
same students are uninterpretable with respect to growth. 

e. Applicability over age range . While NAEP covers 
a wide age span, the use of different exercises that have no 
known relation to each other in tests for different age groups 
raises doubts as to whether the measurement of achievement can 
be considered continuous over the age range. 
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f. Interpretability of scores . It is NAEP policy 
that results be reported in a way that will be understandable 
to educators and interested citizens. Therefore, the statis- 
tical presentation is kept quite simple. However, because the 
data are primarily reported by individual "exercise" per age 
level (and for demographic subgroups witnin each age level) 
the reader must synthesize a great deal of detailed information. 
The interpretive load on readers remains large even when the 
"exercises" are grouped by "themes" (sets of "exercises" 
clustered for reporting purposes) and by objectives. 

1.2 NAEP and the minimum work tasks required to develop 
an effectiveness measure . NAEP has carried out only some of 
the minimum work tasks required to construct an effectiveness 
measure. 

a. Input for setting standards . NAEP provides no 
input data for policy-makers to set standards. 

b. Define adult reading tasks . NAEP does not 
define adult reading tasks; instead, reading objectives are 
defined as cognitive skills that any reader should have. 

c. Organize or cluster reading tasks . NAEP does 
not cluster reading tasks in the process of constructing tests. 
However, following the administration of reading assessment 
measures in 1970-71, the test items themselves were organized 
for reporting purposes into clusters that "have something in 
common . " 

d. Define the construct "reading comprehension ." 
NAEP does define the cognitive skills to be measured in a test 
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of reading comprehension. However, criteria of comprehension 
are not specified. 

e. Construct Validity . The construct validity of 
the raeasures has not been demonstrated, but their content 
validity has been established through an elaborate review pro- 
cedure. 

f. Reliability . No information on reliability has 
been provided yet. 

2. The ETS adult reading tasks . One goal of the Tar- 
geted Research and Development Project in Reading, sponsored 
oy the U.S. Office of Education, is construction for ten-year- 
olds of a criterion-referenced test that will predict competent 
performance on adult reading tasks "selected to have favorable 
returns to the individual and to society in general" (Educa- 
tional Testing Service, 1971, p. 2) . Educational Testing Ser- 
vice (ETS) is assembling the set of adult reading tasks that 
will serve as criterion for the test. 

To define representative adult reading tasks, ETS 
conducted a survey of what they have termed a "national proba- 
bility sample" of adults to learn about their daily reading 
naoits. In this survey, respondents described the types of 
reading done during a 24-hour period, the amount of time 
devoted to each type, and th*? importance of each type. Proto- 
type reading tasks were built to represent the main types of 
reading activities reported in the survey and regarded as 
important by the respondents. The survey data and prototype 
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tasks served as input to expert panels that were to study and 
rank these tasks on a scale of benefits to the individual and 
society, and to suggest high-benefit reading activities that 
were inadequately represented in the set. 

The ETS plan calls for administering the tasks to a 
national sample of adults in order to determine task inter- 
correlations and thus to find, through factor analysis, dimen- 
sions of reading competence. A second sample, on which demo- 
graphic data will be collected, will be used to establish rela- 
tionships between performance on various tasks and economic , 
social, and cultural status levels. The tasks t. ...t are finally 
chosen on the basis of the field tests will serve as the cri- 
terion that the proposed test for ten-year-olds will eventually 
have to predict. 

Although the ETS work is not yet complete, it may be 
tentatively reviewed in relation to the functional specifica- 
tions and minimum work tasks required for an effectiveness 
measure. 

2.1 ETS' and QE's proposed measures and the functional 
requirements for an effectiveness measure in reading . 

a. Individual and system scores . Both types of 
scores presumably could be obtained. 

b. Measurement of progress toward adult competence . 
Yes, but only in a limited sense, namely, whether students at 
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ten years of age show satisfactory performance (yet to be de- 
fined) on test items that predict criterion performance as 
adults. The nature and extent of the relationship between 
test tasks and criterion tasks are not yet specified. 

c. Measurement of growth . There is no provision 
for the measurement of growth in reading competence. 

d. Applicability over age range . The test is 
intended only for ten-year-olds. 

e. Interpretability of scores . Unknown at this 

time. 

2.2 The status of the ETS work and the minimum tasks 
required to develop an effectiveness measure . 

a. Input for setting standards . Establishing links 
between success on reading tasks and the educational or eco- 
nomic status of adults should provide useful input for those 
empowered to set standards. 

b. Define adult reading tasks . This has been done 
by professional test developers and panels of expert advisors, 
taking into account the results of a national survey of reading 
habits . 

c. Organize or cluster reading tasks . Not clear. 
ETS does plan to factor analyze performance on tasks to iden- 
tify underlying dimensions of reading competence. This analy- 
sis may do more for defining the construct "reading comprehen- 
sion" than for organizing the tasks themselves. 
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d . Define the construct "reading comprehension . " 
The proposed factor analyses, c. above, should help to define 
the skills to be measured in a test of comprehension. Cri- 
teria of comprehension have not yet been specified. 

e. Construct validity . Construct validation of 
the criterion adult tasks is planned on a national sample. 
Plans for determining the construct validity of the actual 
test for ten-year-olds have not yet been reported. 

f. Reliability . Not yet determined. 

3. The Harris surveys of "survival" requirements in 
read ing. Louis Harris and Associates have been commissioned 
by the National Reading Center to conduct periodic surveys to 
determine how well adults are able to carry out reading tasks 
of the type required to "survive" in contemporary American 
society. The surveys focus on practical reading skills required 
to cope with common experiences in the lives of Americans, such 
as following directions for direct dialing of telephone calls, 
understanding employment and housing, advertisements, responding 
appropriately to questions on application forms, and so on. 
Test items directly measuring the ability to carry out tasks 
such as these are administered in individual interviews to a 
national sample of respondents selected to represent the 
civilian non-institutional population of the United States. 
Pasults are reported in the form of a composite index of 
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reading difficulty, calculated by weighting items according 
to their difficulty. This index is to be used on a regular 
basis as a measure of functional reading problems in the United 
States (Harris and Associates, 1971). 

While the Harris surveys provide useful information 
concerning selected functional reading skills of adults in 
various demographic groups, they do not meet several important 
functional specifications for a neasure of reading effective- 
ness . 

3.1 The Harris "survival" measures and the functional 
requirements for an effectiveness measure in reading . 

a. Individual scores . Individual scores can be 

provided . 

b. System scores . There is no readily identifiable 
"system," other than the nation's schools as a whole. 

c . Measurement of progress toward adult competence . 
The Harris surveys are not designed to measure progress toward 
adult competence. 

d. Measurement of growth . The Harris surveys do 
not measure growth in reading achievement. 

e. Applicability across age range . The Harris 
surveys are designed only for persons 16 years old or older. 

f. Interpretability of scores . Reports of the 
percent of respondents answering various numbers of test items 



P/281-5-10-1 - 22 



er|c 



correctly should be understandable to persons without techni- 
cal training. However, the National Difficulty Index is not 
easily understood. 

3.2 The Harris surveys and the minimum tasks required 
to develop an effectiveness measure . The Harris surveys have 
carried out some but not all of the minimum tasks required to 
develop a measure of reading effectiveness. 

a. Input for setting standards . It is uncertain 
whether or not the Harris approach provides useful input data 
for setting standards of reading competence. 

b. Define adult reading tasks . Harris has used 
expert opinion to uefine a restricted set of reading tasks, 
namely those considered essential for "survival." 

c. Organize or cluster reading tasks . Harris has 
not organized or clustered the reading tasks in any way. 

d. Define the construct "reading comprehensio n." 

Uncertain. 

e. Construct validity and reliability . No infor- 
mation provided. 

4. The Adult Performance Level Study . The purpose of 
the Adult Performance Level Study (APLS) , being conducted at 
the University of Texas with the support of the United States 
Office of Education, is to define literacy operationally in 
terms of reading and other skills required to function 



P/281-5-10-1 - 23 - 

35 



effectively in "areas of need" which are important for survival 

in our society. Six areas — occupational knowledge, consumer 

economics, health, community resources, government and law, and 

transportation — were identified by means of literature reviews, 

surveys of professional opinion, conferences on adult needs 

with lay and professional participants, and interviews with 

3 

undereducated persons. In each area, reading (and other ) 
skills required for effective functioning were listed. 
Criterion-referenced test items built to test these reading 
behaviors were validated in a nationwide study by determining 
the relationship between success on test items and several 
indicators of the economic and educational status of respondents 
Based on analyses of field test data, a revised list of adult 
performance requirements was developed. 

The APLS plans to increase the comprehensiveness of 
its coverage and conduct more extensive validation studies. 
The APLS expects the set of functional reading (and other) 
tasks that will be compiled to serve to guide the content of 
courses in adult basic education, and also to serve as a mecins 
of assessing functional literacy. 



Other skills measured are writing, speaking or listening, 
computation, problem solving, and interpersonal relations. 
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Although the APLS research is still in process, pub- 
lished reports concerning progress and research plans (Adult 
Performance Level Project Staff, 1973) have enabled RRI to 
review tentatively the extent to which APLS is likely to yield 
reading tests that meet the functional specifications for an 
effectiveness measure. 

4.1 The APLS measures and the functional requirements 
for a measure of effectiveness in reading . 

a. Individual scores . No information provided. 
However, with data reported on an item-by-item basis, there is 
no obvious basis for a meaningful summary score. 

b. Measurement of progress toward adult competence . 
APLS does not provide for the measurement of progress toward 
adult competence. 

c. Measurement of growth , APLS does not measure 
growth in reading achievement. 

d. Applicability over age range . APLS is designed 
for adults only. 

e. Interpretability of scores . Uncertain at this 
time. However, the plan to report data on an item-by-item 
basis poses problems of summarizing data. 

4.2 APLS and the minimum tasks required to develop an 
effectiveness measure in reading . The APLS has carried out 
some, but not all, of the minimum work tasks required to con- 
struct an effectiveness measure. 
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a. Input for setting standards . The analysis of 
important functional reading skills and the relating of per- 
formance on reading tasks to economic and educational status 
should constitute useful input to government for setting stan- 
dards . 

b. Define adult reading tasks . This has been done 
through a combination of reviews of research, expert opinion, 
and surveys of adult (lay) opinion. 

c. Organize or cluster reading tasks . APLS groups 
reading tasks by "areas of need" and "objectives." The tasks 
themselves are treated individually and have not been scaled. 
Although no empirical evidence is given to support the claim, 
APLS contends that performance on particular tasks is predic- 
tive of performraice in the entire "area." 

d . Define the construct "reading comprehension . " 
The reading skills to be measured have been defined. However, 
criteria of comprehension have not been specified. 

e. Construct validity . APLS is establishing the 
construct validity of tasks by determining whether predicted 
relations are obtained between performance on reading tasks 
and the economic and educational status of respondents. 

f. Reliability . No information has been provided 

yet. 
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5. Summary of the approaches taken to measure effec- . 
tiveness i n reading . In Section D, four major approaches to 
the measurement of reading competence were reviewed, both with 
respect to how well the functional specifications for a measure 
of effectiveness in reading were met, and with respect to . 
whether or not the minimum work tasks required to develop such 
a measure wer*i undertaken. The review showed that each ap- 
proach meets soiie of the functional specifications and that, 
in each case, some of the necessary work tasks have been com- 
pleted, but that noli has met or completed all of them. 

None of the approaches reviewed is capedsle of meeting 
two of the functional specifications , namely those for the 
measurement of growth in reading achievement and the measure- 
ment of progress toward adult reading competence, although 
meeting these specifications is critical to the measurement 
of both individual and system reading effectiveness. With 
respect to the minimum work tasks, none of the approaches has 
yet successfully solved the problem of scaling adult reading 
tasks. Failure to organize adult reading tasks creates obvious 
difficulties in building a test that adequately samples the 
task domain, and in reporting and interpreting data. 

RRI recognizes that, since none of the projects 
reviewed set out originally to develop measures of individual 
and system reading effectiveness suitable for use over a wide 
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age range, they should not be faulted for failing to do so. 
The point of the review has been to show that, as innovative 
and useful an these national efforts at assessment and measure- 
ment are, they do not completely satisfy the requirements of 
public education for a reading effectiveness measure. 
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E. ' The RRI Approach Toward the Development of a Measure of 
Effectiveness in Reading 

Under the terms of a contract with the New York State 

4 

Education Department, RRI formulated design concepts that 
would contribute in a significant way to setting standards, of 
reading competence and that would lead to the construction of 
a reading effectiveness measure meeting the functional specifi- 
cations outlined earlier. Under the same contract, RRI also 
developed a plan for implementing these design concepts, i.e., 
for performing the scientific and technical work required to 
develop a new measure of reading effectiveness. 

The heart of the RRI approach, and what distinguishes it 
from other attempts to measure reading competence, is an em- 
phasis on finding ways to characterize adult reading materials 
quantitatively, and to use these quantitative properties of 
reading materials both as inputs for setting standards and as 
the basis for determining whether those standards are being 
met (i.e., for test design). RRI reasoned that, if reading 
materials can be quantitatively scaled in terms of significant 
variables, standards can be defined in terms of the scale 
values found for selected adult reading materials, and that 



Contract #065911. 
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competence can be assessed by determining a person's ability 
to read materials having those specified scale values. RRI 
further reasoned that, if ways could be found to describe all 
reading materials quantitatively, it would become possible to 
relate performance on any reading task to performance on other 
reading tasks and thereby to open the way to measure growth 
in reading competence. 

Three jnajor design concepts for quantitatively characteri- 
zing reading materials were explored. Following a detailed 
evaluation, two of these, word familiarity and the readability 
of text, were found to be powerful enough to establish the 
feasibility of a single effectiveness measure that will meet 
all of the functional specifications described earlier in this 
chapter. These two design concepts and their applications are 
developed in detail in this report. 

The third concept, syntactic complexity of text, was 
judged to be potentially valuable but of doubtful practical 
utility in the near term. At present, models of the syntactic 
structure of the English language do not appear to be suffi- 
ciently developed to permit reliable scaling of the complexity 
of large samples of English text.^ Therefore, syntactic com- 
plexity was dropped as a design concept. 

^ Reviews of the pertinent literature led RRI to conclude that, 
at the present time, syntactic models of English do not appear 
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1. Word familiarity and readability . Studies (cited, 
in Chapter IV) of reading test performance suggest that reading 
success depends on two principal factors : knowledge of indi- 
vidual word meanings; and comprehension of connected text. 
RRI therefore reasoned that, if the vocabulary and the textual 
cnaracteristics of reading materials could be scaled, it should 
be possible to define adult reading competence logically in 
terms of the vocabulary that a reader must know and the diffi- 
culty of text that he must comprehend to be able to read arljlt 
level materials competently. 



to be sufficiently developed to permit reliable mechanical 
scaling of the complexity of passages. Although sentence 
structures can be reliably parsed by applying transformational 
and other contemporary theories of grammar, much uncertainty 
remains concerning sentence complexity. 

At least part of the problem stems from the fact that the 
way in which syntactic and semantic factors interact to produce 
complexity for the reader is not yet understood. The scaling 
of passage complexity is also restricted by the fact that syn- 
tactic analysis has largely been limited to single sentences. 
Consequently, the study of factors producing complexity across 
sentences in connected prose has barely begun. 

RRI's conclusions on these matters are supported by a paper 
on the scaling of syntactic and semantic complexity prepared 
for SED by Finn (1973) in an effort independent of the work 
described in this report. After reviewing his own and others' 
work in some detail, Finn concludes that the application of 
syntactic models to written passages is years away. 

Lacking formal models of passage complexity, RRI considered 
the possibility of analyzing certain syntactic features of 
individual sentences, and uf averaging over sentences to derive 
summary syntactic complexity scores for passages as Chomsky 
(1971) has done. Unfortunately, any such analysis would have 
to be carried out by hand by a trained grammarian, since an 
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RRI proposes that the vocabulary of reading materials 
be characterized according to the familiarity of different 
words to readers of the language. In this work, word famil- 
iarity is defined in terms of the different frequencies of 
occurrence of words in written English (frequently occurring 
words are taken to be more familiar than infrequently occurring 
words) . Passages of text differ from one another in the pro- 
portion of words of high, low^ and jnoderate frequency that 
they contain. Some passages contain many common or very famil- 
iar words; others contain a large proportion of rare words. 



adequate job cannot be done by available computer parsing pro- 
grams. Chomsky (personal communication) has described this 
hand-analytic procedure as " . . . cumbersome and time con- 
suming, and probably not worth all the effort that it requires. 
In view of the very large quantity of material that will pro- 
bably need to be scaled in constructing the RRI reading effec- 
tiveness measure (see Chapters IV and V) , hand analysis of syn- 
tax must be ruled out for practical reasons. 

If an intensive programirdng effort were undertaken, RRI might 
be able to achieve computer analysis of syntax. However, such 
an intensive effort does not appear to be justified in terms 
of the additional knowledge that would be gained about students 
reading auility. As the review in Chapter III will show, 
syntactic factors are so entwined with readability that sepa- 
rately evaluating students' ability to comprehend materials 
at different levels of syntactic complexity would probably 
be redundant with evaluating their ability to comprehend ma- 
terials at different levels of readability. 
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If the word- frequency characteristics of adult reading materials 
can be quantitatively described , it should be possible to define 
reading competence operationally by estimating the number of 
words from different frequency bands that a person would need 
to know to be able to read those materials successfully. 

RRI further proposes that connected text be scaled 
for its comprehension difficulty or readability. Readability 
is a summary characteristic of text determined by the inter- 
action of structural and stylistic factors. These factors 
combine to make some passages of text easier (or harder) to 
comprehend than others. By scaling the readability of adult 
reading materials, the level of textual difficulty that a 
person must be able to comprehend to be able to read specified 
adult materials successfully can be determined. 

Neither the scaling of readability nor the measure- 
ment of word frequencies are new ideas. To the best of RRI's 
knowledge, however, they have never before been used separately 
or together as a means either for defining standards of reading 
competence or for measuring the extent to which those standards 
have been met. 

2 . Defining standards of competence . The information 
obtained from systematic, quantitative measurements of word 
familiarity and readability can be used to set standards of 
adult reading competence. The same capability that enables 
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the familiarity of words occurring in any passage of text and 
tae difficulty of that passage to be measured quantitatively 
also makes it possible to define quantitatively the skill 
levels — in terms of word knowledge and comprehension — required 
to perform any reading task. If government specifies the 
written materials that graduates of the educational system are 
expected to be able to read, and if these written materials 
are scaled for readability and word familiarity, then the 
levels of word knowledge and comprehension skill that graduates 
of the educational system must reach are operationally defined 
by the measured word frequency and readability characteristics 
of the designated materials. Thus, performance standards for 
educational processes can be established. 

3. A new measure of reading competence . A new approach 
to the construction of an effectiveness measure in reading 
follows logically from the preceding argument that standards 
of reading competence can be defined in terms of the scaled, 
linguistic properties of reading materials. Once quantitative 
standards of reading competence are defined empirically, simple 
and direct measurements of the extent to which students have 
attained these standards can be made by administering reading 
tests consisting of items that have been scaled for the same 
linguistic properties that were used to define the standards, 
i.e., for word familiarity and readability. 
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In the compreUension sections of such a test, some 
passages would correspond in readability to the difficulty 
level designated as the adult standard. Other passages would 
be more and less difficult. A student's performance on pas- 
sages whose readability corresponds to the standard of adult 
difficulty is directly interpretable in terms of whether or 
not adult standards of competence have been met. In the event 
tnat adult standards have not been specified when the tests 
are administered, performance on passages of text scaled with 
respect to readability can be used to provide an accurate 
assessment of the level of difficulty of text that a graduate 
is able to comprehend.^ 

The vocabulary sections of the test of reading effec 
tiveness would contain words sampled systematically from the 
different word frequency bands constructed from RRI's analysis 
of adult reading materials . The performance of students on 
these sections of the test could be interpreted directly in 
terms of their knowledge of words in each of the frequency 
bands, and student performance could be evaluated, in this 
way, in terms of the vocabulary required to meet adult reading 



Such information cannot be obtained from current norm- 
referenced tests because the readability of passages in those 
tests is not systematically varied. 
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standards. In the event that standards have not been specified 

when the tests are administered, a student's knowledge of words 

can be compared to the vocabulary used in a wide range of adult 
7 

materials. 

If the vocabulary and passages of text that make up 
the reading effectiveness test reflect the full range of school 
reading materials — from primers all the way to adult- level 
text — then the test instrument could be administered periodi- 
cally to monitor a student's progress toward adult reading 
competence as the student moves through school. 

The word familiarity and readaUaility design concepts 
lead not only to measurement of the progress toward and attain- 
ment of adult reading competence, but also to measurement of 
students' attainment of grade-level reading objectives. 
Following the same logic that was used earlier to define adult 
competence, grade-level objectives in reading can be defined 
operationally by analyzing instructional materials to deter- 
mine the expectations that are being placed on students con- 
cerning knowledge of words of particular frequencies of occur- 
rence cuid comprehension of text of given difficulty at each 



Such information cannot be obtained from current norm- 
referenced vocabulary tests because the so-called "blueprints" 
for such tests do not involve the systematic selection of 
test words from different frequency bands. 
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grade level. With grade level objectives for word familiarity 
and reading comprehension thus defined, it would be possible 
to determine the extent to which students meet grade-level 
expectations with the proposed measure of effectiveness in 
reading, i.e., the extent to which students at different . 
grade levels know words of appropriate frequencies of occur- 
rence and can comprehend text of suitable levels of reada- 
bility. 
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p. The RRI Approach and the Functional Specifications for 
a Measure of Effectiveness in Reading 

RRI's approach to the definition and measurement of effec- 
tiveness in reading, outlined above, will result in a measure 
of reading competence that will meet all the functional speci- 
fications listed earlier in this chapter. 

1, Individual scores . The RRI effectiveness measure 
will yield word knowledge and reading comprehension scores 
for each individual. Depending on whether or not standards 

of competence have been established, scores may be used either 
to evaluate achievement levels in relation to grade-level ob- 
jectives or adult standards of competence, or to describe the 
current achievement level of the student. 

2, System scores . The scores of individuals on the RRI 
effectiveness measure can be aggregated to obtain reading 
effectiveness scores for schools, districts, regions, or for 
tne state as a whole. System effectiveness could be examined 
for various subgroups (e.g., by sex, ethnicity) by aggregating 
over appropriate individuals. 

3, Measurement of progress toward adult competence . 
The effectiveness measure can be used to determine the levels 
of skill and knowledge required to perform important adult 
reading tasks competently, e.g., to read materials required by 
law, to read materials required for entry into different 
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vocations, and to read materials required to function as a 
competent adult. Since readability and word familiarity can 
be scaled continuously, progress toward becoming a competent 
adult reader in any or all o£ these areas can be monitored by 
periodically administering tests in which vocabulary and reada- 
bility gradually reach the level appropriate to adult tasks. 

4, Measurement of growth . The scaling of reading mate- 
rials makes it possible directly to compare students' perfor- 
mance on different forms of the same test or on a graduated 
series of tests ^ thereby providing a basis for measuring 
growth in reading achievement. Reliable measurement of group 
growth should be obtainable using scaled materials. Reliaible 
measurement of individual growth depends essentially on whether 
sufficient time can be devoted to testing in narrow word- 
frequency and readability ranges (see Chapter IV) . 

5, Applicability over age range . Use of the word fre- 
quency and readability design concepts as a basis for tests of 
competence makes it possible, in principle, to use a common 
measurement scale over the entire public school aye range. 
However, in the earliest grades, different reading programs 
may not overlap sufficient y in vocabulary to provide a common 
base for testing. Thus, the earliest grade in which a common 
measure can be used that is unbiased with respect to any par- 
ticular reading program must be determined. 
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6. InterpretabiXlty of gcores . The RRI effectiveness 
measure will yield scores that should be readily understood. 
Reports of reading competence will be anchored directly to 
performance, rather than to how a student's reading achieve- 
ment compares with that of his peers. 

A report on reading comprehension might read, "John 
can comprehend all the materials used in his fifth grade 
reading program. A sample paragraph is enclosed illustrating 
the most difficult materials used in this program. His present 
reading skill would probably allow him to comprehend adult 
reading materials such as (examples given) . " Vocabulary test 
results can also be reported in a way that can readily be 
understood by parents, teachers, and other concerned citizens. 
A report might read, "John meets all of the vocabulary require- 
ments of his fifth grade program. Relative to what he will 
need to know as an adult, John now knows 80% of common words, 
40% of moderately familiar words, and 15% of rare words." 



8 

Peer comparisons, e.g., percentile and stanine scores, 
can be provided if needed. 
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G. The RRI Approach and the Minimum Work Tasks for Deyelopinc^ 
an Effectiveness Measure in Reading 

In addition to developing the design concepts introduced 
above and a general strategy for their implementation, RRI has 
also begun, or haii developed programmatic plans for, the minimum 
work tasks required to produce a reading effectiveness measure. 

1. Legislative-political tasks . The RRI design concepts 
will make it possible to provide government with precise in- 
formation concerning the readability and word familiarity 
characteristics of various adult reading materials. This 
information will make it possible to define reading competence 
operationally in terms of the levels of readability that a 
student must comprehend and the word knowledge he must have 

in various feuniliarity bands to read competently those materials 
designated by government as essential or important. 

2. Scientific-technical tasks . 

• Define adult reading tasks . Alternative ways of 
defining a representative collection of adult 
reading materials were examined. A preliminary 
decision has been reached to use the domain of 
periodicals to define the range of content and 
difficulty of adult reading materials. Several 
smaller domains of practical importance have also 
been identified. 
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• Cluster adult reading tasks . Representative random 
samples of text from periodicals will be scaled for 
word familiarity and readability. Any specific 
reading task or set of tasks can be similarly scaled 
and related to the text drawn from the periodicals. 

• Define the construct "reading comprehension . " In 
building the reading effectiveness measure, RRI 
will define reading comprehension in terms of c 
student's ability to understand a passage of text 
sufficiently well to correctly identify words that 
have been deleted from it. Thus, RRI has decided 
to measure one general comprehension factor rather 
than several distinct comprehension subskills. 
Since research indicates that comprehension sub- 
skills are highly interrelated, measurement of a 
general comprehension factor should adequately 
measure the skill (s) usually encon^assed by the 
construct "reading comprehension." To be credited 
with comprehending a passage, a student must cor- 
rectly answer a sufficient number of questions to 
reduce the probability that his score occurred by 
chance along to an acceptably low level. 

< Technical development of measures . Although actual 
developemant of the measures has r . yet begu;., 
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some plans for their tecimical development have , 
been formulated. 

• Item selection . A quasi-cloze format has been 
chosen for items in the comprehension section of 
the test. In this format, a word is deleted from 
text and students must select the deleted word 
from cunong several options provided. Response 
options will be controlled for word familiarity 
and semantic plausibility. In addition, a strategy 
has been formulated for regulating the familiarity 
of response options in the vocabulary section of 
the test, so that test results can be interpreted 
unambiguously with respect to a student's knowledge 
of words in various frequency bands. 

• Demonstration of construct validity . A preliminary 
strategy has been formulated for verifying that the 
proposed items adequately measure reading comprehen- 
sion. In essence, validation would be carried out 
using factor analytic techniques to compare test 
results obtained using the RRI items with results 
obtained when other item types are used and when 
multiple-comprehenBion subskills ar^ measured. The 
validity of the criterion established for crediting 
a student with comprehension of material at a given 
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level o£ readability can be tested in separate 
studies requiring behavioral evidence that the 
student san comprehend other materials of the 
same difficulty level. 
• Determination of test reliability . RRI believes 
that it will be necessary to employ or develop 
new procedures for determining test reliability. 
Procedures currently used for determining the 
reliability of norm-referenced tests will not be 
applicable to RRI's criterion-referenced effec- 
tiveness measure. 
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H. Other Applications of the Design Concepts 

The principal applications of the readability and word 
familiarity design concepts are to provide information for the 
setting of standards of reading competence and to measure the 
extent to which students have met those standards. Other im- 
portant applications have also been identified. For example, 
RRI believes that the design concepts could be used to analyze 
instructional materials to determine whether the voccUsulary 
and readability demands that are placed on students at different 
grade levels contribute to the failure to meet adult standards 
of reading competence. This line of investigation could lead 
to recommendations for changes in the readability and vocabu- 
lary content of instructional materials. Since such recom- 
mended changes would be designed to make the readability and 
vocabulary content of instructional materials more rational, 
tneir implementation would increase the likelihood that students 
will reach adult standards of reading competence as they pro- 
gress through the educational system. 
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Chapter II 
Measuring Word Familiarity 

It is well known that the words in the English lexicon, 
like those of other languages, can be classified in terms of 
their frequencies of occurrence, i.e., some words occur more 
often than others. It seems reasonable to assume that a com- 
petent adult reader must know all of the most frequently occur- 
ring words, nearly all of the next-to-most frequent words, some- 
what fewer of the words in the next lower frequency class, etc. 

Therefore, it follows that, if the frequency-of -occurrence 
characteristics of words in adult materials can be scaled, and 
if students' knowledge of words in various frequency bands can 
be systematically determined, it should be possible to compare 
a students' word knowledge with the vocabulary required to 
read adult materials competently. 

The idea of formally scaling words in terms of their fre- 
quencies of occurrence in order to measure students' vocabu- 
laries represents a significant departure from current practice 
in testing, but one that is needed if a new effectiveness 
measure is to be built. Tests of word knowledge are, of course 
widely used both in norm-referenced measures of reading achieve 
mant and in measures of general abllit/ U.Q-). However, no 
direct inference concerning the scope or size of a student's 
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vocabulary can be drawn from such tests of word knowledge because 
the words tested in norm-referenced measures are not obtained 
from a systematic sampling of the words in the lexicon. Sys- 
tematic sampling is not required in norm-referenced tests because 
the purpose of such tests is primarily to determine how many 
words (of those tested) the child knows compared with other 
students. Witn this purpose in mind, words in the final version 
of a norm-referenced test are apt to be chosen largely for their 
power to discriminate among students. 

Therefore, currently used measures of word knowledge do not 
provide a basis for judging whether a student's knowledge of 
words is adequate to allow him to read adult materials. However, 
for the measurement of effectiveness in reading, a test is re- 
quired that permits direct inferences concerning students ' pro- 
gress toward and attainment of adult standards of reading 
competence. RRI believes that the concept of word frequency 
or familiarity^ can lead to the construction of a measure from 



In this report, the terms word frequency and word familiarity 
are used interchangeably. It is intuitively reasonable to 
suppose that words which occur very frequently in written 
language will be very familiar to readers, while seldom used 
words will be less familiar. Research supports the assumption 
of such a relationship between frequency and familiarity. It 
has been found, for exaimple, that words with high frequencies 
of occurrence are recognized more rapidly (Howes and Solomon, 
1951), and heard more readily in noise (Postman and Rosenzweig, 
1957) than words with low frequencies of occurrence. 

It has also been found that reading rates are faster for more 
frequent words than they are for less frequent words (Pierce 
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which such inferences are possible. The concept of word famil- 
iarity permits the vocabulary charac; iristics of adult materials 
to be scaled quantitatively, and provides a basis for building 
tests to measure students' word knowledge on the same scales. 



and Karlin, 1956). In short, experimental subjects behave as 
though they are, in fact, more familiar with words having a 
high frequency of occurrence than they are with words that 
occur less often. 
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A. Scales of Word Frequency 

1. The need to scale word frequencies formally * The 
frequency of words must be formally scaled. While the notion 
of word familiarity is easily understood in a rough, intuitive 
way, it is quite another matter to apply it with any precision. 
Everyone, for example, will agree that the word arachnid is much 
less familiar than the word house; but it is not so clear how 
the comparison would go between more closely-matched pairs of 
words like philanthropy and extrapolation , or table and book . 
To make objective rankings of the familiarity of closely- 
matched words, a numerical scale of word familiarity is required. 

Such scales can be built by drawing a large sample 
of written material and observing the number of times particular 
words occur in it. The result of dividing the number of occur- 
rences of a particular word by the total number of words in the 
sample gives the observed frequency of occurrence of that word, 
which may also be taken to define its familiarity. For example, 
suppose that in a seunple of 1,000,000 words, the word water 
occurs 1,500 times. Then the observed frequency (and the famil- 
iarity) of the word water in that sample would be calculated as: 



An extremely common word has been chosen for this example; 
most words have very much lower frequencies. To avoid the 
nuisance of working with very small numbers, it will probcibly 
be desirable to alter the definition somewhat, si-y, by using 
a logaritlimic scale or some such device. 
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2. Existing scales of word frfaquency . A number of word 
frequency scales, based on word counts of samples of the English 
language, have been developed. The best known and most exten- 
sive of these is the Teacher's word book of 30,000 words 
(Thorndike and Lorge, 1944), which is based on a count of over 
20 million words from a variety of printed sources. A more 
recent count has been made by Kucera and Francis (1967) , but 

is based only on one million words. 

Counts based on specialized materials have also been 
prepared. These include Horn's (1926) count of five million words 
in personal and business correspondence, Rinsland's (1945) count 
of six million words in children's compositions, Howes' (1966) 
count of 250,000 words of adult spoken English and, most recently, 
the American Heritage Intermediate Corpus (Carroll, Richmeui, and 
Davies, 1971) of five million words in instructional materials 
used in grades three to nine. All the existing scales, however, 
are based on samples that are too small to allow the precision 
of measurement required to draw accurate inferences concerning 
word knowledge in various frequency bemds, and to detect growth. 

3. The need for an enormous word sample . The vast majority 
of English words have extremely low frequencies. To obtain a 
reasonably accurate estimate of these frequencies, and enormous 
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sample is required. The need for large samples can be illus- 
trated by the following data from the American Heritage Interme- 
diate (AHI) corpus of five million words. 

The following words (among others) each occurred 
exactly five times in the AHI corpus: helm, cuticle, gossamer , 
boredom, villa , grate , cutlass , stuffy , repast , debut, jocund , 
gadfly , therapeutic , sabotage , euglena , decoy . The observed 
frequency for each of these words is 0.000001 (once per million). 
The following words (among others) each occurred exactly ten 
times in the AHI corpus: esophagus , mermaid, gadget , lavish, 
needy, plaintiff , lilac, vengeance , sustain , musty , belfry , 
rascal. The observed frequency for each of these words is 
0.000002 (twice per million). 

However, it cannot be stated with certainty that words 
in the second set are more familiar than those in the first set, 
even though they exhibit twice the frequency in the AHI corpus, 
because it is not certain that the observed frequency of cuiy of 
these words (obtained from this particular sample) equals the 
true frequency of the words in the entire universe of written 
English. The small size of the sample results in uncertainty 
in true frequency estimations. For example, if a word occurs 
five times in a sample of five million words, there is a 95% 
certainty that its true frequency of occurrence lies somewhere 
between 0.00000042 and 0. 00000238 (that is, 95% of the words 
which occur five times in the sample have true frequencies 
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between these limits) . For a word which occurs ten times in the 
sample, the corresponding limits are 0.00000107 and 0.00000373. 
The relatively small r,ize of the sample produces "fuzziness," 
or low precision, in the estimate, and overlap between the 95% 
confidence limits for the twj sets of words. 

Now observe the effect of increased sample size on 
the true frequency estimates of the words considered above. 
In a sample of fifty million, the 95% confidence limits for a 
word that occurs fifty times would be 0.00000075 and 0.00000133; 
those for a word occurring or hundred times would be 0.00000163 
and 0.00000244. The overlap is gone, but there is still some 
fuzziness. In a sample of five hundred million words, we get 
mucn better resolution: the 95% confidence limits for a word 
occurring five hundred times are 0.00000091 and 0.00000109, while 
those for a word occurring a thousand times are 0.00000138 and 
0.00000213. 

No matter how large the sample is (within practical 
limits) , there will always remain a rather large class of words 
for which only a crude estimate of true frequency will be pos- 
sible. These words are the so-called hapax legomena , the very 
rare words which occur only once or twice even in an extremely 
large sample. For such words it is impossible to obtain good 
resolution of their true frequency of occurrence, as shown 
±n Table 1. 
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Table 1 

95% Confidence Limits for True Frequencies of Words 
Occurring 1, 2, 3, and 4 Times in a Sample of Size N 
(where N is assumed to be very large) . 



No. of 95% Confidence Limits 

Occurrences for True Frequency 

1 0.172 5.828 

N+4 N+4 

2 0.536 7.464 

3 1.000 9.000 

N+4 N+4 

4 1.528 10.472 

N+4 N4l~ 
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There will also be a sizeable class of words which 
do not occur at all f even in an extremely large sample. Al- 
though these words cannot possibly be identified, the data of 
Carroll et al. (1971) suggest that there may be a way to esti- 
mate their total number. In the case of the AHI corpus of five 
million words, for example, it was estimated that only about 
15% of all English word types were represented. 
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B. Word Frequencies and the Lognormal Distribution 

While RRI cannot make direct use of existing word counts 
in establishing a word familiarity scale because they are based 
on Scunples that are too small, such counts do provide valuable 
information concerning the general shape and properties of word 
frequency distributions. 

1. Types and tokens . In describing word frequency dis- 
tributions, it will be helpful to introduce two terms commonly 
used in vocabulary studies. A type is a particular word, while 
a token is a particular occurrence of a word. For example, the 
AHI corpus consists of five million tokens, representing some 
eighty-five thousand different types. The single type water 
accounts for about 7500 tokens (i.e., this word occurred 7500 
times) in the AHI corpus. 

2. The shape of word frequency distributions . Now let us 
suppose that we have a sample of N tokens, representing G difr 
ferent types. If a particular type occurs J times in the sample, 
then the fraction ^^^N is the observed frequency of that type. 

Of course, there may be many types having the seune observed 
frequency. Let Gj denote the number of types which have an 
observed frequency '^'^N. Then the fraction (^j)/g is the pro- 






will get a curve similar to that in Fig. 1. 
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LARGE PROPORTION OF 
LOW-FREQUENCY TYPES 



A FEW ISOLATED 
HIGH-FREQUENCY TYPES 



FREQUENCY J/N 

FIG. I TYPE DISTRIBUTION CURVE FOR A MODEST-SIZED 
SAMPLE. 



ERIC 



F/281-5-10-1 



- 56 - 

()8 



ERIC 



The curve shown in Fig. 1 is typical of word frequency 
data obtained from samples of modest size. It shows that, the 
lower the frequency, the more numerous the types. If the sample 
is extremely large, however, the picture changes: The curve 
bends downward at the left end as a consequence of the fact thav 
the number of very rare types is outnurnbered by the number of 
types of slightly higher frequency. Fig. 2 shows how the curve 
would look for a very large sample. 

The reader should note that the curve in Fig. 2 has 
been distorted in order to show its shape more clearly. If the 
curve were drawn accurately to scale, the peak would be extremely 
close to the vertical axis, and the curve would be very narrow 
and E ii-vrply bent on the high-frequency side. Fig. 3 shows the 
shape of the curve more accurately, but even it is distorted to 
some extent: the peak is not as close to the vertical axis, and 
the curve is not as narrow and sharply bent, as they would be 
if the curve were drawn to scale. 

From such diagrams, we may begin to develop a sound 
intuition about the make up of the English lexicon. It consists 
of a small number of extremely common types, plus an enormous 
number cf types having very low frequencies. In the AHI sample, 
for example, the ten most common types (the, of, and, a, to, in, 
is , you , that , it) accounted for nearly 25% of all the tokens, 
while the hundred most common types accounted for nearly 50% of 
them. At the other extreme, there were more than 35,000 types 
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FREQUENCY J/N 

FIG. 2 TYPE DISTRIBUTION CURVE FOR A VERY LARGE SAMPLE, 
DRAWN TO DISTORTED SCALES TO SHOW GENERAL SHAPE. 
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FREQUENCY J/N 

FIG. 3 TYPE DISTRIBUTION CURVE FOR A VERY LARGE 
SAMPLE, DRAWN MORE NEARLY TO SCALE THAN 
FIGURE 2. 
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which occurred only once each, and another 12,000 which occurred 

only twice each. These figures probably would be somewhat dif- 

3 

ferent for a leramatized corpus, but the general pattern should 
be niuch the same. 

3. The lognormal distribution . Several investigators 
(Herdan, 1960; Carroll, 1968) have found that these type 
distribution curves are matched with extraordinary precision 
and faithfulness over most of their range by what are known as 
lognormal distributions. This name is derived from the fact 
that, if a logarithmic scale rather than a linear scale is used 
for the horizontal "frequency" axis, then the type distribution 
curve assumes the familiar, symmetrical shape of the normal 
distribution. 

The fact that the lexicon is lognormally distributed 
is extremely important because the lognormal model furnishes a 
way to describe type distributions precisely and succinctly. 
Any normal distribution is completely described by two parameters, 
M and a (the mean and the standard deviation) . The same two 
parameters are sufficient to des ,ribe the corresponding log- 
normal distribution (although they no longer have quite the same 
meaning). In other words, the lexicon can be characterized with 

^ Lemmatization, discussed later in this chapter, refers to the 
process of reducing a word to its dictionary form by stripping 
it of affixes. For example, teach, teacher, and teaching would 
all be reduced to teach and thus would be treated as one type 
for purposes of counting frequencies of occurrence. No lem- 
matization was carried out in constructing the AHI corpus. 
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precision if m and a can be calculated. This calculation involves 
some fairly sophisticated mathematics, the details of which 
are of no concern here. The interested reader will find an 
outline of this calculation in Appendix A. 

The significance of the lognormal model for the devel- 
opment of a reading effectiveness measure is that the model 
permits written materials to be described quantitatively in a 
compact form. The ability to quantify and to represent simply 
the familiarity characteristics of word samples makes it possible 
to compare easily the vocabularies in different kinds of written 
materials. Such comparisons are important in setting standards 
of adult reading competence (see Chapter V) . The ability to 
describe vocabulary quantitatively in a compact form also makes 
it feasible, if samples of words are properly drawn and students' 
knowledge of these words is appropriately tested, to measure 
the extent to which a student has attained the vocabulary re- 
quired to read adult materials. 
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C. The Construction of a Word Familiarity Scale 

The construction of a word frequency scale is straight- 
forward, at least in principle. It begins with the specification 
and acquisition of a representative collection^ or corpus, of 
English reading materials. The domain of periodicals appears 
to be representative of the universe of written English, and 
has been selected tentatively as the domain from which a rep- 
resentative, random sample will be drawn to constitute the RRI 
corpus. The RRI word familiarity scale will be developed from 
the words that occur in this corpus (see Chapter IV) . 

The corpus will be constructed by drawing a random sample 
of passages of text from different periodicals. The amount of 
text drawn from any periodical will be proportional to its cir- 
culation or press run. The representative, random sample of 
passages of text will be fed into a computer, and the occurrence 
of various word types will be counted to yield a scale of word 
familiarity. Once the corpus of periodicals has been scaled 
and M and <r have been calculated, smaller specialized corpora 
(e.g., instructional materials, government publications) can 
also be scaled, and the results related to the results obtained 
for the major corpus. 

1. Some problems to be resolved in counting word types . 
In order to count the occurrence of word types, a set of deci- 
sions must be made concerning how certain word types will be 
treated. 
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a. Inflections . Perhaps the most important decision 
is that associated with grammatical inflections. For example, 
should the words talk, talks, talking , talker, and talked be 
regarded as distinct words having different familiarities, or 
should they all be grouped together under the single word talk ? 
This process of classifying words into "dictionary entries" is 
known as lemmatization . 

Although final decisions on lemmatization have 
not yet been made, it is likely that only regular grammatical 
forms will be lemmatized. RRI is inclined to believe that ir- 
regular forms must be learned as separate vocabulary items, and 
do not inherit the same familiarity status as their roots. 
Once the decisions concerning lemmatization strategy have been 
made, it is expected that the great bulk of the lemmatization 
can be carried out economically by computer. There will be 
some need for human editing, however, to catch such mistakes 
as, say, lemmatizing stocking together with stock, or hammer 
together with ham . 

b. Compound words . Closely akin to the lemmatization 
problem is the problem associated with compound words. The word 
coffeepot , for example, occurs far less often in print than 
either of its two components. Therefore, it would receive an 
unduly low familiarity rating unless some special attention is 
paid to it. It may well prove reasonable to assign to such 
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compounds the same familiarity rating as their least familiar 
components. Such decisions will be postponed until the data 
have been gathered. 

c. Affixes . The situation is even more puzzling in 
the case of words affected by prefixes or suffixes, for these 
addenda themselves are not equally familiar. One may be willing 
to agree that a person who understands the word legal will also 
understand the word illegal, but the same agreement would not 

be forthcoming in the cases of words like quasi-legal or extra- 
legal . Another difficulty arises from the fact that it is 
entirely possible for the affix-laden word itself to be more 
familiar than the root word from which it was (presumably) 
derived, e.g. , uncanny , unkempt , unravel. 

d. Spelling . Still another nuisance is the incidence 
in English of variant spellings. It would seem reasonable to 
treat minor variants (like color and colour ) as if they were 
identical words, but it is not so clear what should be done with 
cases like jail and gaol . Similar remarks would also apply to 
the deliberately misspelled words that occur in renderings of 
dialect speech (or in the poetry of Ogden Nash) . 

e. Shortened forms . Coiitractions and abbreviations 
also present problems. In some cases, they should probably be 
classified as separate words. However, what is to be done with 
the longer and rarer ones, such as shouldn ' t and Phila? Ought 
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they to be treated as separate words, or do they belong with 
should and Philadelphia insofar as familiarity is concerned? 

f . Homographs . Thus far, consideration has been 
given to the need to put certain non-identical words into the 
same pigeonhole for the purpose of establishing their famil- 
iarities. The opposite problem must also be confronted: 
separating identically-spelled words having different meanings. 

The simplest example of this problem is furnished 
by homographs, i.e., pairs of words which, though totally dif- 
ferent, are spelled alike: does (is doing) and does (female 
deer) ; or entrance (doorway) and entrance (bewitch) . Unless 
some very special and elaborate precautions are taken, the 
computer, in its innocence, will certainly throw both members 
of such a pair into the same bin, thus arriving at a misleading 
count. 

Sometimes capitalization provides a sufficient 
clue for separation. Thus, we can distinguish Polish (the 
language) from polish (shine) and March (the month) from mar oh 
(step) . But in other cases there seems to be no way to effect 
the separation except by examining the context of the word each 
time it occurs in the sample. 

g. Multiple meanings . A more subtle, and much more 
common, source of false collocation of ^^'ords is the fact that 
one and the same word may be used with two or more different 
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meanings. Consider, for example, the word rather in the following 
two phases: 

(1) a rather tall building 

(2) I'd rather stay home. 

It would be gratuitous to assume that rather in (1) is just as 
familiar as rather in (2). For all practical purposes, the 
problem here is the same as that of outright homographs. 

From the foregoing discussion, it might appear 
that, in order to arrive at a sound word-familiarity scale, not 
only must RRI draw an immense sample of words, but also the 
context in which each of these words occurs must be scrutinized, 
However, the task is not as forbidding as it may seem. 

First, only a small minority of words exhibit 
truly different multiple meanings. These words can be identi- 
fied in advance by consulting a dictionary. Only these words 
would call for examinations of context. Second, it would not 
be necessary to go through the outire sample to determine the 
frequency of each sense of an ambiguous word. A few dozen cita- 
tions should be sufficient to establish a statistically stable 
pattern. To illustrate, let us take the word entrance. Suppose 
that, of the first 100 occurrences of this word in the sample, 
it is found, by examining context, that 85 have the sense "door- 
way" while 15 have the sense "bewitch." Then it is probably 
safe to assume that 85% of all occurrences of this word have the 
first sense, and 15% the .econd. 
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Finally, it is possible that the entire task can 
be done by computer. The recent work of P.J. Stone and his 
colleagues at Harvard has resulted In a sophisticated computer 
program which reputedly can make decisions concerning whlcih 
sense of a given word is Intended in a passage of text. This 
program is available and if the cost of using it is not too 
great, it can be used to dispose of virtually all complications 
that arise, including the fact that words of identical appearance 
can differ in meaning depending on context. 

h. TechnicaJ terms . One further complication arises 
in connection with rare words. Many of these words are highly 
specialized technical terms, and there is a tendency to use such 
words repeatedly, if at all. An example is the word la thy kin , 
which is the name of a special tool used in making stained-glass 
windows. If this word appears at all in the sample, it is likely 
to appear not just once but dozens of times because there is no 
other word for this tool. Thus we will tend to overestimate its 
true frequency in the universe of written English. 

There are several ways to deal with this com- 
plication. One way would be simply to delete such words from 
the corpus. This procedure, however, would require human editing 
and would introduce an undesirable element of subjectivity. 
Another way would be to let them stand, relying on the rather 
low readership enjoyed by such words (i.e., the low clcculatlon 
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of journals that use the word) to correct the overestimate of 
their frequencies. Still another way was followed in the 
analysis of the AHI corpus. It consisted of introducing, for 
each word, a measure of dispersion of usage of that word among 
the different subject-matter categories. If a word was found 
to be used only in a few subject categories, then the observed 
frequency of that word was reduced (in some cases conjiderably) . 
RRI will probably use a similar procedure, although the exact 
form of the dispersion measure has not been determined yet. 
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D. A Familiarity-Based Vocabulary Measure 

The construction of a measure of students' word knowledge 
in various frequency bands should be relatively straightforward^ 
once a word-f amiliarit scale has been established. Since a 
profile of a student's word knowledge in several frequency bands 
is to be provided, test words should not be chosen at random 
from many different parts of the familiarity scale. Rather, 
they should be chosen carefully from a few specified, narrow 
intervals of th<=^ scale. 

From a student's performance on the test words in a par- 
ticular frequency interval, it should be possible to draw a 
valid inference? concerning the student's knowledge of all the 
types belonging to that frequency interval. For example, 
suppose a subject gets 15 right out of 20 words tested in inter- 
val A. Then it can be stated with 95% confidence that he knows 
between 53% and 89% of the type? belonging to A. Or, it can be 
stated with 95% confidence that he knows at least 57% of those 
types. If more than 20 items are tested in interval A, the 
precision of the estimates can be improved. For example, if the 
number of test words from interval A is rais<ed to 100, and if 
the subject gets 75 of these right, then it can be stated with 
95% confidence that he knows between 65% and 83% of the words 
belonging to interval A, or that he knows at least 67% of these 
words. Assuming that the number of test items remains the same, 
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such precise conclusions would not be possible if the selection 
of test words had not been restricted to a narrow interval on 
the word familiarity scale. 

By testing a student' i knowledge of words in each of several 
narrow frequency bands, sevejil different scores can be obtained. 
These separate scores would forn, a profile that would describe 
the student's vocabulary, relative to what is required to read 
adult materials competently, with far more precision than any 
single score could hope to do. 
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Chapter III 
Measuring the Readability 
of English Text 
A. Problem and Background 

It is a commonplace observation that some passages of 
written material are harder to understand than others. This 
chapter deals with design concepts for rrcaling this varia- 
bility in passage difficulty, or ree- . tity. As the term 
is generally used, readability refer:' -o the relative com- 
prehensibility of dxfferent textuel materials: if Passage A 
is easier to understand than Passage B, A is said to be more 
readable than B. Scaling of the readability of written 
materials is an essential prerequisite for the development 
of the proposed RRI reading effectiveness measure. 

1. Need for a measure of readeJaility . A measure of 
readability is prerequisite to the measurement of reading 
effectiveness. It provides a means of objectively quanti- 
fying standards of reading competence. Once readability has 
been scaled, levels of difficulty that characterize the 
materials a competent adult must read can be specified. By 
testing students' ability to read materials at those levels 
of difficulty, it is possible to determine whether or not 
v^iey have met adult standards. Moreovpr, by administering 
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tests calibrated for increasing passage difficulty (from the 
simplest levels up to adult levels), students' progress toward 
adult reading competence can be measured. 

Scaling the difficulty of reading materials is also 
a prerequisite to analyzing how well the short-term perfor- 
mance objectives of a reading program have been met. The 
capability co quantify the reading difficulty level of instruc- 
tional materials used in various grades and progrcuns makes it 
possible to test students on passages of difficulty comparable 
to that found in the instructional materials being used, and 
thus permits a determination to be made of whether or not the 
expected level of reading competence has been achieved. 

The designers and publishers of standardized norm- 
referenced reading tests provide no information concerning the 
readability of any ot the passages of text used in the tests. 
Since reading comprehension levels on norm-referenced tests 
are defined exclusively by comparing a student's performance 
with that of his peers, the precise readability of passages 
included in the tests does not really matter. The only 
significant property that these passages and questions must 
have is that they span a sufficiently wide range of difficulty 
to permit reliable ranking of children's relative reading 
comprehension skill. Average performance on a set of test 
passages by children of a given age defines the norm of 
reading achievement for that age, regardless of the particular 
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characteristics of the passages or questions on which the 
scores were obtained. For this reason, as noted in Chapter I, 
a reading achievement score on a norm-referenced test cannot 
be interpreted directly in terms of the kinds of material a 
child can read and comprehend. By contrast, since the speci- 
fied purpose for building the RRI reading effectiveness mea- 
sure is to be able to interpret reading test scores directly 
in terms of the kinds of materials students are able to read, 
the readability characteristics of the test passages used for 
measuring achieveioiaivt mat fee •^uiowci precisely. 

2 . A brief history of readability research and formula 
cons truction . The idea of scaling the difficulty level of 
reading materials is not new. Klare (1963) may be stretching 
the point when he traces interest in the comprehensibility 
of messages back to Biblical days and to the Talmudists of 
the Middle Ages. However, it is certainly true that since 
the 1920 's there has been a steady stream of studies concerned 
with rating the relative comprehensibility of different 
reading materials' and with identifying the various structural 
and stylistic factors that make one passage relatively more 
or less difficult to understand than another.^ 



The discussion of readability in this chapter is limited to 
analyses of structural and stylistic variables thar account 
for differences in the comprehensibility of text.. Subsidiary 
aspects of readability such as type face and legibility, or 
the extent to which the readability of a passage varies across 
readers as a function of interest or experience, will not be 
c^isciissec^. This limitation is imposed because the objective 
fif Hhl^s work is to scale the language characteristics of the 
reading materials themselves. 
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The primary purpose of readability research in educa- 
tion has been to find a simple and objective way to judge the 
appropriateness of different reading materials for students 
with a given level of reading ability, without going through 
the actual process of asking students to read the materials 
to determine which are too easy and which are too difficult. 
A secondary purpose, more typical of publishing and journalism 
than of education, has been to learn how to write or revise 
materials so that they meet appropriate levels of difficulty 
for specified audiences of readers. 

The desire for a quick way to judge the appropriate- 
ness of materials for particular groups of students has led 
to the construction of "formulas" that typically estimate the 
approximate grade-level reading achievement needed to compre- 
hend the materials. Many readability formulas have been pro- 
posed during the last 50 years. The total number published 
is not known, since it depends on how "formula" is defined. 
In his review, Klare (1963) lists 31 formulas, though more 
than 50 can probably be cited if less stringent rules are 
applied. Some of the better known and more widely used for- 
mulas were developed by Lorge (1939) , Dale-Chall (1948) , Flesch 
(1948), Gunning (1952), Farr-Jenkins-Patterson (1951), and 
Spache (1953). Simplified formulas to facilitate more rapid 
calculation of readability have recently been published by 
Fry (1968) and McLaughlin (1969). 
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Most readability formulas have been built in the 
following way. The author of the formula begins by selecting 
some set of passages, known to vary in difficulty, to serve 
as his criterion scale of readability. He counts the occur- 
rence in those passages of structural or stylistic variables 
that he believes cause some passages of text to be more dif" 
ficult to comprehend than others. He then calculates the 
algebraic combination of those variables that best predicts 
the (predetermined) difficulty of the criterion scale. The 
equation giving the best prediction becomes the formula. 

Although most formulas are alike in that they were 
built by weighting predictor variables against a criterion 
scale of reading difficulty, they vary widely in almost every 
other significant way. They differ with respect to the factors 
used to predict readability, the criterion scale against which 
the formula was originally validated, the difficulty range to 
which the formula is applicable, the definition of comprehen- 
sion used in scoring the original criterion passages, the 
sampling method for selecting passages of text for analyses, 
the counting rules used in computation, and the units in which 
readability is expressed. 

3. Shortcomings of existing formulas . Readability 
formulas have been widely used by publishers to control r 
adjust the difficulty of instructional materials. .ave 
also been extensively used by educators, who employ them to 
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decide whether instructional materials are suitable for stu- 
dents who have a given level of reading ability. However, 
existing formulas may not be accurate enough to warrant the 
wide use they receive, and they certainly do not appear 
adequate to the task of building the RRI measure of reading 
effectiveness. 

One serious shortcoming of existing formulas is that 
they do not predict enough of the variance in the criterion 
scale scores that they were originally designed to predict. 
The Dale-Chall formula, which is reported (Klare, 1963; Powers, 
Sumner & Kearl, 1958) to have the highest validity coeffi- 
cient (r = .71) of any wide-range formula, is able to account 
for only about one half of the variability in reading dif- 
ficulty of the criterion passages. Other popular formulas 

2 

are even less powerful predictors of criterion readability. 

A second shortcoming of available formulas is that 
they are not very accurate. The Dale-Chall formula, which is 
reported to have the smallest standard error of measurement 



An exception is the Spache formula that, with a validity 
coefficient of r = .82 (Spache, 1953), accounts for about two- 
thirds of criterion readability variance. However, the Spache 
formula is suitable only for primary-grade reading materials, 
and was built using criterion scale (publishers' grade level 
assignment of texts) which leads, for technical reasons, to 
inflated estimates of validity. 
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of any wide-range formula (Powers, Sumner & Kearl, 1958), 
has an error of measurement of .77 grades, this means that 
over 30% of the time Dale-Chall readability scores will 
deviate from "true" readability scores by more than three 
quarters of a school year. Other formulas either have larger 
errors of measurement or report none at all. 

Where only rough estimates of readability are 
required, perhaps this relatively low level of precision in 
the formulas can be tolerated, although some critics contend 
that current readability formulas do more harm than good 
because of their imprecision (Bormuth, 1966). However, the 
proposed reading effectiveness measure requires greater pre- 
cision in the scaling of passage difficulty than the available 
formulas provide, in order to be able reliably to detect 
small gains in reading achievement (such as might occur from 
the beginning to the end of a school year) and to measure 
accurately whether adult competence standards have been met. 

While the poor predictive power and large errors 
of measurement of currently available formulas constitute the 
most serious barriers to using them in building a measure of 
reading effectiveness, other practical considerations also 
make these formulas unsuitable. No single formula is appli- 
cable across the entire range of difficulty to be covered by 
the reading effectiveness measure. Moreover, even within 
their applicable ranges, formulas are more accurate over some 
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ranges of difficulty than others (Chall, 1958). This is 
mainly a function of the difficulty range of the criterion 
against which the formula was originally validated, but may 
also reflect erroneous statistical assumptions in the con- 
struction of the formulas (Bormuth, 1966) . Beyond the center 
of the range for which each fornula was built, derived 
readability scores (arrived at by extrapolation or by an 
adjustment to the value yielded by the equation) tend to be 
so approximate as to serve no useful function for RRI's 
purposes (of. Dale-Chall 1948^ . 

Although it has been accepted practice in publishing 
and education to use different readability formulas for dif- 
ferent segments of the difficulty range, this expedient cannot 
be used for the proposed reading effectiveness measure. The 
various formulas are not sufficiently alike to warrant treating 
them as though their scores reflected one continuous scale. 
The various formulas share neither predictors, criteria, nor 
computational procedures. Since growth in reading achievement 
on the proposed reading effectiveness measure is to be measured 
on a continuous scale, extending from the primary grades to 
adult levels, it is essential that all materials be rated with 
a formula based on a common set of predictors and criteria, 
and that the formula be equally sensitive across the entire 
difficulty range. Because existing formulas do not meet these 
requirements, there seems to be no alternative but to build a 
new readability measure. 
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This review of the shortcomings of existing formulas 
has been limited to technical problems with the formulas 
themselves. We have found that they rlo not predict enough 
criterion variance, that they have large errors of measure- 
ment, and that none is applicable over a sufficiently wide 
range of readability. Therefore we have not considered it 
necessary to question the validity of the criteria which 
existing formulas were built to predict. In the next section, 
however, dealing with selection of a criterion for a new for- 
mula, the reader will see that questions of cr iter iot\ validity 
could legitimately be raised. 
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B. A New Readability Formula 

In constructing a new readability formula, the two roost 
important decisions that will need to be made concern the 
selection of a criterion scale of difficulty against which 
the formula will initially be validated, and the selection of 
predictor variables to be included in the formula. 

1. C riterion scale of readability . Readability research 
has focused principally on predictors and, at least until 
recently, investigators have exhibited little concern for the 
quality of the criterion that is preaic^ied. Authors pub- 
lishing readability formulas usually have not reported on the 
reliability of their criterion measures (presumably because 
this reliability is unknown) , although it is an accepted prin- 
ciple of measurement that successful prediction requires a 
reliable criterion. Some authors barely describe their 
criteria, even though proper interpretation of readability 
scores depends on a precise understanding of the criterion 
used in building the formula. 

The quality of the criter. ^.s essential to the 
utility of the formula. However astute an investigator may 
be in selecting variables that he believes should be predic- 
tive of the difficulty of text, the ultimate ability or 
inability of his formula to predict accurately the difficulty 
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of new passages will be a function of the validity and relia- 
bility of the criterion against which the formula is origi- 
nally built. Since the criterion serves as the basis for 
accepting or rejecting potential variables for inclusion in 
the formula and for determining their weights, the better the 
criterion, the better those decisions are likely to be. 

Three major kinds of criterion scales of readability 
can be identified in the research literature. These are: 
sets of passages scaled in terms of concurrent norm-referenced 
reading achievement test scores; publishers' grade level desig- 
nation for books; and passages scaled for readers' ability to 
correctly guess deleted words. Each of these procedures for 
defining criterion scales will be discussed in turn. 

a. Passages scaled against norm-referenced test 
scores. This type of criterion has been used more often than 
any other in the construction of readability formulas. Of 
this type, the most widely used set of criterion passages 
are the McCall-Crabbs grades test lessons in reading (1926), 
which were scaled in the following way. Students in grades 
three througn six read 390 passages and answered multiple 
choice questions about them (seven through twelve questions 
per passage) . The same students also took a standardized 
reading test. The grade placement for each passage was 
arbitrarily defined as the avdrage reading grade level of 
students who correctly answered 75% of the questions for that 
passage . 
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r.ie McCall-Crabbs test lessons (or other 
similarly graded passages) are unsuitable as a criterion 
scale for developing the required readability formula. There 
are two principal criticisms that invalidate this type of 
approach. One problem is that the grade levels assigned to 
the passages may not directly reflect the true difficulty 
levels of the material, since the scale values assigned 
depend on students' answers to multiple choice questions 
about the passages. Because it is a relatively simple matter 
to alter the comprehension scores that utudents can earn on 
a passage by changing either the type of question asked or 
the response options, the difficulty of passages measured 
this way resides as much in the questions asked as in the 
text itself. It is virtually impossible to demonstrate that 
the test items (questions and response options) are of 
comparable difficulty across passages. Noncomparability of 
item difficulty over passages would obviously make some pas- 
sages easier (or harder) than they actually are, i.e., than 

they would be if item difficulty were controlled across all 
3 

passages. 



This type of imprecision in the criterion may partially 
account for the relatively poor prediction of criterion scores 
obtained from most readability formulas, noted above. 
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Second, the McCall-Crabbs passage scale values 
are based on relations between the percentage of questions 
answered correctly by students on that passage, and students' 
reading ability as measured by scores on a norm-referenced 
reading test. However, standardized reading test scores are 
themselves not directly interpretable with respect to what 
children can read. For one thing, norm-referenced scores are 
themselves based on multiple-choice comprehension questions 
whose difficulty is deliberately varied across passages. Thus 
norm-referenced scores reflect something other than just stu- 
dents' abilit^" to comprehend passages of increasing difficulty. 
Using norm-returenced scores in an essentially circular way 
to define the "difficulty" level of other test passages seems 
likely to result in a criterion scale whose values do not 
correspond with precision to true differences in the inherent 
difficulty of reading materials. Thus the norm-referenced 
methodology underlying the McCall-Crabbs passages, or any 
other scales similarly constructed, makes them inappropriate 
to serve as a criterion of readability for the RRI reading 
effectiveness measure. 

b. Publishers' ratings of books . Some authors 
of readability formulas (e.g., Spache, 1953) have used the 
grade level designations given to textbooks uy publishers as 
the criterion scale of difficulty against which to validate 
their formulas. Since major textbooK publishers control the 
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sentence length and vocabulary content of their books, espe- 
cially in the early elementary grades, formulas that include 
sentence and vocabulary variables (as almost all do) would 
be expected to be good predictors of publishers' assigned 
grade levels. Whether formulas built to predict publishers' 
ratings of books are truly good predictors of readeibility is 
another matter. 

At present, there is insufficient evidence to 
justify the assumption that publishers systematically increase 
the reading difficulty level of their instxMCtional materials 
over grades. While publishers may control the number and 
familiarity of words and the lengths of sentences in their 
books to some extent during the early grades, there is evi- 
dence that they Jo not agree concerning which words to teach 
in which grades (Stauffer, 1966). Furthermore, Fry (1968) 
has reported that the readability of instructional materials 
changes more in some grades than in others (in an apparently 
random pattern) . There is also reason to believe that the 
type and amount of control that publishers exercise over 
•eadability change as students get older (Spache, 1953) . 

Furthermore, even if publishers were to attempt 
systematically to control readability (e.g., by using existing 
formulas) , it is not certain that the resulting materials 
would in fact be scaled for comprehensibilix.y as intended. 
Attempts to alter tne readability ot passages by changing the 
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values of various formula components (e.g., simplifying vocabu- 
lary, shortening sentences) have not had any consistent effect 
on readers' measured ability to comprehend the altered passages 
(Klare, 1963) . The failure to affect comprehensibility can 
probably not be attributed to differences between the difficulty 
of questions asked about original and altered passages, because 
questions were usually held constant over both versions of the 
passages. A more likely explanation is that current readabi- 
lity formulas fail to include enough of the important variables 
that affect the difficulty of text. 

In short, publishers' gr^'.de level designations 
for books are not a suitable criterion for building the RRI 

readability formula because there is insufficient evidence that 
these designations are based on known difficulty characteris- 
tics of the materials. A formula is needed that is firmly 
anchored to properties of the reading materials, rather than 
one built merely to reflect what publishers believe children 
ought to learn in different grades. 

c. Predicting deleted words » The cloze procedure . 
In recent years the cloze technique^ (Taylor, 1953) has 
attracted attention as a new means for defining criterion 
scales of readability. In the cloze technique, words are 



The term cloze is derived from the concept of closure as 
used in Gestalt psychology. Closure refers to the human 
tendency to complete a familiar but not quite finished pattern 
(for example, to see a broken circle as whole by mentally 
closing up the gaps) . 
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randomly or periodically deleted from text, and subjects 
are asked to guess the missing words. The cloze score is 
simply the percent of deleted words that are restored cor- 
rectly* If several passages are deleted in a comparable way 
and presented to a group of readers for restoration, the pas- 
sages can be ranked for readability according to their rela- 
tive cloze scores: the higher the score, the more readable 
the passage. 

The cloze technique represents a practical 
application to written English of research results that have 
confirmed speakers' ability to utilize the redundancies of 
language to extract information from garbled or incomplete 
messages. Language is characterized by rules that limit how 
elements (letters and words) may be combined, and by recurrent 
patterns that make some elements more probable in certain 
contexts than others. Because of these regularities, an 
element that is yet to come in a message is in some degree 
constrained by the elements that have preceded it. 

For excunple, in English text, the letter "q" 
almost always signals that the letter "u" will follow. On 
the level of words rather than letters, the incomplete sen- 
tence "The man felt very " provides a great deal of 

information about the next word to occur. A user of English 
anticipates such words as "happy," "tired," or "weak." He 
would be surprised if the next word ware "chimney," "there," 
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or "drink." To the extent that what is to come in a sequence 
of words or letters is constrained to some extent by what has 
preceded it, the appearance of a particular word or letter is 
predictable to some degree from the context, and the sequence 
therefore possesses some redundancy. 

The redundancy of English has been estimated 
to be 60-75% (Shannon, 1951; Garner, 1962). It is believed 
that this redundancy increases the likelihood that a message 
will be correctly received by slowing down the rate of infor- 
mation transmission, and by providing safeguards against the 
occurrence of communication failures due to accent, hand- 
writing, noise, and ambiguities inherent in the language 
itself. 

Since Shannon's (1948) initial applications of 
information theory to the study of language, a considerable 
body of evidence has been amassed indicating that users of 
English have learned to employ the redundancies of the lan- 
guage (Garner, 1962). Knowledge of these redundancies is 
demonstrated by users' ability to replace missing elements 
in a message, both at the level of letters (Chapanis, 1954; 
Miller & Friedman, 1957) and at the level of words (Aborn, 
Rubens tein & Sterling, 1959; Aborn & Rubenstein, 1958; Shepard, 
1962) . 

More redundant sequences of words, by defini- 
tion, should be easier to predict than less redundant sequences. 
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Taylor (1954) reasoned that the cloze technique, which tests 

subjects' ability to predict deleted words, should therefore 

provide a measure of redundancy. Taylor's (1954) research 

confirmed that cloze scores measure the redia-if-ancy present 

in text. He found that the cloze scores for deleted words 

had a rank order correlation of r ■ .87 with the estimated 

5 

redundancy of those words m context. Thus, cloze scores 
can be taken as a good estimate of the relative redundancy of 
language units in a passage of English text. 

The degree of redundancy present in a passage 
of text and the readability of that text are related. The 
presence of redundancy reduces the amount of information 
transmitted in a message of fixed length. Therefore more 
redundant messages should be easier to comprehend — that is, 
more readable" — than less redundant messages. Since cloz€i 



A computation of the redundancy of words in passages of 
English text by direct statistical analysis of sequences is 
presently unfeasible because of the size of the English lexi- 
con and the difficulty of determining the distributional 
uncertainty of the words that occur in it. Instead, the 
redundancy of words can be estimated by assuming that sub- 
jects ' predictions of words at a given location in a sequence 
provide good estimates of the probabilities governing the 
occurrence of the predicted words in that location. Subjects 
are presented with samples of English text that are (n-1) 
words long and are asked l:o predict the nth word. The redun- 
dancy, R, of a predicted word is estimated by computing 

["max" ^1 , where H is the vmcertainty (measured in bits) of 
"max J 
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scores provide an estimate of the redundancy present in a 
passage of text, and since redundancy is related to readabi" 
lity, it follows that cloze technique should be able to mea- 
sure the readability of text. 

Studies comparing cloze scores wii± more tradi- 
tional measures of readability comfirm that cloze scores do 
measure the readability of passages. Taylor (1953) found 
that the readability rank-ordering of three passages by both 
the Flesch (1948) and Dale-Chall (1948) formulas was repro- 
duced by cloze scores. In two studies, Bormuth (1962, 1968a) 
found rank order correlations greater than r * .90 between 
the ranking of passages by cloze tests and the ranking of 
passdc-es by multiple choice comprehension tests. 

The procedure for using cloze tests to deter- 
mine the readability of a set of passages is simple. Com- 
parable deletions are made in each passage removing an equal 

The observed distribution of predicted words and H „ is the 

max 

possible uncertainty, which is obtained when all predicted 
words occur equally often. For example, if a large number of 
subjects guessed three different words for a deleted word at 
some location in a passage of text, the maximum amount of un- 
certainty would be obtainea if each word was used an equal 
number of times, i.e., each word occurred one third of the 
time. In these circumstances, the probability of occurrence 
of each word, p, would be p = .33 and, since H is given by 

-Hip log-p), H = 1.58 bits. If, however, the proportion 

z max 

of times each of the words was guessed was .50, .25, and .25, 
then the observed uncertainty is H = 1.50 bits, and, therefore, 

R =, 1«58-1. 50 = ,05, or 5 percent. 
F/281-5-10-1 - 89 - 



101 



number o£ words remdomly or periodically (e.g., every fifth, 
seventh, tenth word) from each passage, and by replacing each 
deleted word with a blank of standard length. Taylor (1953) 
has shown that random and periodic deletion patterns yield 
equivalent data. 

Deletions are made without regard for the 
function or meaning of specific words. Deleting only cer- 
tain classes of words (e.g., only substantive words) is 
rejected because specified words or kinds of words may not 
occur equally often in different materials. Differences 
between passages in terms of the number of words occurring 
in different classes may itself be a readeUsility factor, and 
its effect can be measured only by a method that operates 
independently of the number of words of different classes 
occurring in a passage. 

The proportion of correct restorations per 
passage is the cloze score for that passage. The higher the 
cloze score, the more readcUsle the passage.^ Only exact 



The present discussion describes how cloze tests are used 
to measure the relative readability of several different pas- 
sages. This is done by presenting several passages to some 
reader (s) and comparing the cloze scores earned by the several 
passages. Another usf. of cloze tests is possible. They can 
be used as a measure of students' reading comprehension. In 
the latter case, several students take a cloze test over the 
same passage (s) and readers' scores are compared. 

Several studies have shown moderately high correlations 
between cloze scores and scores on standardized tests of 
reading (Taylor, 1957; Ruddell, 1965; Bormuth, 1965). However, 
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restorations (and obvious misspellings of exact restorations) 
are counted as correct, since Taylor (1953) has shown that 
the readability scores of passages are not affected by 
allowing credit for synonyms or by allowing partial credit 
for words that maintain the general meaning of the sentence. 

Cloze scores have several important advantages 
over comprehension questions and over publishers' assigned 
grade levels as a criterion of readability. They are highly 
reliable (Taylor, 1953; MacGinite, 1971) , whereas the relia- 
bility of other criteria is generally unknown. Between- 
passage differences in cloze scores are directly attributable 
to differences between the comprehensibility of the text of 
the passages, since estimates of passage difficulty are not 
affected by characteristics of the test items, e.g., by 
wording, response options, type of questions asked, etc. 
Variability in passage difficulty that could result from 
using one set of deletions rather than another set is easily 
controlled by using several different deleted versions (for 
example, some subjects restore eve-ry fifth word beginning 

cloze scores should not be interpreted uncritically as a 
measure of comprehension. Salzinger, Portnoy & Feldman (1962) 
have shown that it is possible to correctly restore words to 
passages that are semantic nonsense, so long as the short- 
term contextual constraints of English are present. Chapanis 
(1954) found that the extent to which subjects successfully 
predict deleted units depends on their level of language 
skill. Thus it appears that language skill, per se , plays 
some role in determining cloze scores . 
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with the first word, others restore every fifth word beginnijjg 
with the second word, etc.). 

Finally, cloze scores have one other important 
characteristic that is not shared by other criterion scales 
of readability: they are known to be related to learning. 
Studies by Bormuth (1968b) and by Coleman & Miller (1968) 
have shown that the amount of information acquired from 
studying a passage is a function of subjects ' original cloze 
scores on that passage. Since the RRI readability formula 
will be used to construct tests designed to measure the extent 
1:;d which students have learned to read, it is desirable that 
the criterion scale of readability used to build the reada- 
bility formula have a demonstrated relationship to learning.^ 

Recently, a criterion scale of readability 
covering a wide range of reading difficulty has been built 
with the cloze procedure (Miller & Coleman, 1967), Using 
college students as subjects. Miller and Coleman computad 
two cloze scores for each of 36 passages. One score (bilat- 
eral cloze) was based on the proportion of correct restora- 
tions made when subjects saw words on both sides of the 
deletions and the other score (unilateral cloze) was based on 



The relationship between learning (measured by information 
gain) and passage difficulty (measured by the cloze technique) 
is reviewed more thoroughly in Chapter V. 
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correct restorations when subjects saw only the words pre- 
ceding the deletion, with all subsequent words masked out. 
Using these two sets of scores, which correlated r = .93 
with each other, Miller and Coleman ranked the 36 passages 
from very easy to very difficult. 

The validity of the Miller-Coleman scale has 
been demonstrated in a study by Aquino (1969) . She found 
correlations above r = .90 between the Miller-Coleman scale 
values and the two independent procedures for ranking the 
same passages. These validating procedures were word-for- 
word recall of the passages, and judges' rank-ordering of 
passage difficulty. The fact that Aquino's subjects were 
drawn from a different population than the subjects used by 
Miller and Coleman lends increased weight to these findings 
regarding the validity of the scale. 

An indirect test of the validity of the Miller 
Coleman scale was provided by Szalay (1965) . He used four 
readability formulas that had been developed using the Miller 
Coleman scale as a criterion to predict the cloze scores 
subjects would earn on a new set of passages. Correlations 
between the actual and predicted cloze scores ranged from 
r = .83 to r « .89. The Miller -Coleman scale appears to be 
valid, since readability formulas based on it can be cross- 
validated at high levels of correlation. 
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In view of the advantages of the cloze proce- 
dure over norin-^referenced scales and publishers* ratings , and 
in view of the validity data presented above, it appears that 
the Miller-Coleman scale is the best available criterion for 
the development of a readability formula for use in con- 
structing the RRI reading effectiveness measure. 

2. The need for a formula to predict readability . In 
view of the preceding discussion concerning the utility of 
the cloze procedure as a means for scaling the readability of 
passages, an explanation is in order concerning why the cloze 
procedure can be used only to develop a criterion for building 
a readability formula, rather than as a procedure for directly 
scaling adult-level reading materials and instructional mate- 
rials used in the schools. In other words, why not directly 
apply the cloze procedure to scale the passages whose reada- 
bility must be determined? 

The reason that this cannot be done is practical 
rather than theoretical. It is true that direct scaling of 
passages by readers is feasible when a reasonably limited 
number of passages is to be rated. However, the quantity of 
material which may have to be rated for the reading effective- 
ness measure is so large that any direct scaling approach is, 
in effect, ruled out. A more practical alternative is to 
develop a formula for predicting the cloze scores that pas- 
sages would earn if direct scaling were carried out. With 
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such a formuld, the readability of any passage can be readily 
estimated without recourse to direct scaling by readers. 

3. Variables that predict readability . Development of 
a formula for predicting readabHity requires that the struc- 
tural and stylistic variables that discriminate between easier 
and harder passages be identified, and that the weighted com-., 
bination of those variables capable of predicting criterion 
cloze scores v;ith the greatest degree of accuracy be deter- 
mined. The readability literature provides a good basis for 
at least a first attempt at selecting predictor variables . 



been proposed as possible indicators of reading difficulty. 

Factor analysis of those characteristics that have been the 

best predictors of reading difficulty has identified two 

major factors: vocabulary difficulty and sentence complexity 

(Brinton & Danielson, 1958; Stolurow & Newman, 1959). Of 

the two factors , vocabulary difficulty has been consistently 

the more important predictor. Thus it is not surprising that 

Klare's (1963) review shows that over half the formulas built 

to date include some type of vocabulary measure, while about 

one-third employ^'some measure of sentence complexity. 

p 

3.1 Measures of vocabulary difficulty . Many measures 
of vocabulary difficulty have been used to predict readability 

^ Numerous experimental results from the readability litera- 
ture will be cited to support the discussion in this and the 
following section of the report. These results appear in the 



Over the last 50 years, a great many variables have 
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Basically, vocabulary variables that have been used fall into 
three main classes: difficulty of vocabulary defined by the 
presence or absence of words appearing on a list of "easy" 
words; difficulty of vocabulary estimated by word length; and 
difficulty of vocabulary defined by some semantic property of 
the words, such as abstractness , 

a. "Easy" words . In the first category, vocabu- 
lary difficulty is measured by comparing each word in the 
passage against a list of supposedly easy words. Each word 
in the passage is classified as easy or difficult according 
to whether or not it appears on the list, and either the 
number or proportion of hard (or easy) words is calculated 
for the whole passage. The widely used formulas developed 
by Lorge (1948), Dale & Chall (1948), and Spache (1953), 
employ word lists in this way. 

The two lists most widely used are the Dale 
list of 769 words (Dale, 1931) and the Dale list of 3000 



form of correlations between each ot a numoer or preaxuuui. 
variables and some criterion scale of readability. In reading 
these results, the reader should bear in mind that the alge- 
braic sign of the correlation (i.e., whether the correlation 
is positive or negative) does not affect the strength of the 
relationship between the variable and the criterion of reada- 
bility. The strength of that relationship depends only on 
the magnitude of the correlation (i.e., the absolute size). 
Whether the correlation coefficient (r) is positive or nega- 
tive depends on two factors . 

First, the sign of the correlation depends on whether a 
higher value of the variable is related to more readable 
text or to less readable text. Tn some cases (e.g., proportion 
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words (Dale & Chall, 1948). The Dale list of 769 words origi- 
nally contained those words that appear in both the Inter- 
national Kindergarten Union List (1928) and in the first 1000 
words of the Thorndike-Lorge Teachers word book (Thorndike & 
Lorge, 1944). This list was updated by Stone (1956), who 
replaced 173 words with an equivalent number of words appearing 
more often in contemporary, primary grade, reading textbooks. 
The Dale 30U0-word list contains approximately 3000 words 
"known" by at least 80% of fourth graders. The list was 
compiled by simply presenting lists of words to fourth graders 
and asking them to indicate by check mark which words they 
knew. When 80% of the students tested indicated that they 
knew a word, that word was included on the "easy" list. 



of easy words), as the value of a predictor variable increases, 
readability increases (text gets easier) . In other cases 
(e.g., proportion of hard words), as the value of the pre- 
dictor variable increases, readability goes down (text gets 
harder) . 

The second factor affecting the sign of the correlation is 
the criterion scale of readability used in the research being 
reported. When the McCall-Crabbs (or similar) scale is the 
criterion, or when publishers' grade ratings of books are the 
criterion, higher scale values indicate less readable (harder) 
text. However, when cloze scores are used as the criterion 
of readability, higher scale values indicate more readable 
(easier) text. Therefore, if a variable correlates positively 
with McCall'^Crabbs scores or publishers' grade level assignment 
of books, it must correlate negatively with cloze scores. 
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Estimates of voccibulary difficulty based on 
the presence or absence of a passage's words on an "easy" 
word list correlate rather well with criterion scales of 
readability. Lorge (1948), Dale & Chall (1948), and 
MacGinitie & Tretiak (1969) found correlations of r = .51, 
r = .68, and r = .63, respectively, between the proportion 
of words not on Dale lists and the grade levels of the McCall- 
Crabbs test lessons. Using publishers' assigned grade levels 
for textbooks as his criterion, Spache (1953) found a cor- 
relation of r = .68 between the proportion of words not on 
the Dale 769-word list and his criterion. 

Using cloze scores as a criterion, Bormuth 
(1966) obtained correlations of r = .68 and r « .64 for the 
proportion of passage vocabulary appearing on the Dale 769- 
word list and the Dale 3000-word list, respectively, while 
MacGinitie & Tretiak (1969) found correlations of r « -.51 
for the proportion of words not on the Dale 769-word list. 
(The smaller correlation coefficient of the latter study may 
be due to the fact that only one of five possible deletion 
sets was used to compute cloze scores.) The highest correla- 
tions with criterion scores yet reported for vocabulary dif- 
ficulty based on word lists are reported by Coleman (1971), 
who found correlations of r « -.91 between Miller-Coleman 
scores and the ratio of words not on the Dale 3000-word list. 
It is likely that the difference between the magnitude of the 
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correlations reported by Bormuth and those reported by Coleman, 
both of whom used criteria based on cloze scores, is attrib- 
utable to differences in the range of difficulty represented 
in the criterion scales. The Bormuth passages had an approxi- 
mate readability range of grade 4.0 to grade 8.0, whereas the 
Miller-Coleman passages cover a range which appears to be at 
least twice as wide. 

Recently, a more sophisticated procedure for 
classifying the difficulty of vocabulary on the basis of word 
familiarity has been proposed. Elley (1969) suggests using 
word frequency rather than simply presence or c±)sence on a 
list of "easy" words. He argues that, since correlations 
which depend on a two-unit scale (e.g., presence or absence) 
are usually lower than those based on a graduated scale, a 
more refined measure of word familiarity (such as relative 
frequency of occurrence) should turn out to be an improved 
predictor of readability. He further proposes that only 
nouns be counted, on the ground that they are the least 
predictable elements in a passage and are, therefore, most 
critical to the understanding of a communication. 

Elley computed the mean noun frequency value 
for 58 passages, using frequencies calculated from counts of 
words used by children. Across five validity studies in which 
judges' ratings of passage difficulty were correlated with 
the readability ratings based on the noun frequency counts. 
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the average correlation was r • .90 (range r » .85 to .95). 
The noun frequency count was a more powerful predictor of 
difficulty judgments than were all of 11 other predictors, 
which included two intact readability formulas and several 
major variables used in well-known readeU^ility formulas. 
Therefore, Elley's work suggests that graduated ratings of 
word familiarity may predict criterion scores more accurately 
than simple binary classification of words as easy or hard. 

b. Word length . Word length has been used in 
some formulas as an index of vocabulary difficulty since, on 
the average, longer words tend to be less familiar (and 
hence, more difficult) than shorter words. Formulas using a 
word-length factor to measure vocabulary difficulty have 
included such characteristics; as the number of syllables 
per 100 words, the proportion of monosyllabic and polysyllabic 
words, and average word length in letters and sylleibles. 
Dale & Tyler (1934) and Gray (k Leary (1935) found that the 
percentage of one-syllable words correlated r » .38 and r - .43, 
respectively, with a criterion comprehension test. Later 
studies have shown higher correlations b(?.tween word length 
measures and criterion scores. Flesch (1948) reported a 
correlation of r » .66 between average word length in syllables 
and McCall-Crabbs scale values. Bormuth (1966) found that 
average word length in syllables correlated r « -.8 'J with 
criterion cloze scores and that the corresponding correlation 
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for word length in letters was r » -.68. Coleman (1971) has 
reported correlations of r = .88 and r « -.90 between Miller- 
Coleman scale scores and, respectively, the number of one 
syllable words and the number of letters per word. Again, the 
larger correlations found by Coleman than by Bormuth are 
probably attributable to the greater difficulty range in 
Coleman's criterion scale. 

c. Semantic word factors . The third type of esti- 
mate of vocabulary load requires some judgment concerning the 
semantic properties of the language in a passage. This 
approach is based on the assumption that, on the average, 
abstract words are harder to read and comprehend than concrete 
words. Therefore, counts have been made of many types of 
words presumed to discriminate between passages on an abstract- 
concrete continuum, including image-bearing words, sensory 
words, technical words, concrete ideas, abstract ideas, local- 
isms, simple word labels, nouns of abstraction, finite verbs, 
definite words, realistic or specific words, references of an 
energetic, forceful, or vivid nature, formal versus popular 
words, definite articles, time nouns, and interjections. 
Because of imprecise definitions of the above variables, it 
is hard to know just how much overlap there is among them. 

Although correlations as high as r = .68 have 
been reported between at least one "abstraction" variable 
(definite articles) and a readability criterion (Gillie, 1957), 
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formulas using a semantic approach to measuring vocabulary 
load have not been as successful, in general, in predicting 
criterion variance as have formulas using either word lists 
or word length (Klare, 1963). An apparent reason for their 
limited success is that there is little agreement as to how 
abstraction can be objectively defined. 

3.2 Measures of sentence complexity . The second major 
factor affecting the readability of text is sentence com- 
plexity. Many different measures of sentence complexity have 
been tried in readability formulas. These may be grouped 
into measures of sentence length, prepositional phrase mea- 
sures, and measures of syntax. 

a. Sentence length . The most frequently used 
mea:3ure of sentence complexity has been sentence length, i.e., 
the average number of words per sentence. The rationale for 
using sentence length as an indicator of difficulty is that 
longer sentences are, on the average, more complex than 
shorter ones. Correlations of r = .47 (Lorge, 1948; Dale & 
Chall, 1948), r = .52 (Flesch, 1948), r = .57 (Coleman, 1971), 
and r = -.58 (Bormuth, 1966) have been reported between sen- 
tence length in words and various criteria of readability. 
Spache's (1953) reported correlation of r « .75 between mean 
sentence length and publishers' primary grade, textbook-ltvel 
assignments is probably inflated by the fact that publishers 
control sentence length in primary grade texts. 
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Recent research by Bormuth (1966) suggests that 
average sentence length, measured by counting total syllables 
or letters, may prove to be even better predictors of reading 
difficulty than sentenc^^longth in words. Sentence length 
in syllables and in letters correlated r = .70 and r = -.67, 
respectively, with a cloze score criterion. The same research 
indicates that independent clause length may be a more power- 
ful predictor of readability than any of these sentence length 
measures, since letters per independent clause correlated 
r = -.81 with cloze scores, 

b. Prepositional phrases . Another measure of 
sentence complexity that has been used in readability for- 
mulas has been a count of prepositions or prepositional phrases. 
Correlations between prepositional phrase measures and reada- 
bility criteria have been reported as r = .35 (Dale & Tyler, 
1934; Gray & Leary, 1935), r = .43 (Lorge, 1948), and r = -.41 
(Bormuth, 1966). 

However, there is some question as to the true 
predictive value of prepositional phrase counts in readaUsility 
formulas. The Oale-Chall formula, which differs from the Lorge 
formula chiefly in its lack of a prepositional phrase variable, 
is a better predictor of criterion readability than is the 
Lorge formula. Moreover, MacGinitie & Tretiak (1969) found 
that the relative contribution of prepositional phrases to 
reading difficulty varied drastically from one sample of 
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McCall-Crabbs criterion passages to another. They also found 
that, after sentence length and word difficulty were taken 
into account (predicting 80% of the variance in Miller-Coleman 
scale scores) , the ratio of prepositional phrases added less 
than one tenth of one percent to the prediction of criterion 
scores . 

c. Syntactic analyses . An early attempt to assess 
directly the syntactic complexity of passages was made by 
Vogel & Washburne (1928) , who counted the number of simple 
versus compound and complex sentences. Apparently, the power 
of these variables was not sufficient to gain them widespread 
use. 

In recent years, attempts have been made to 
develop predictor variables that would measure syntactic com- 
plexity with analytic procedures derived from theories of 
transformational grammar. One such predictor is word depth, 
which summarizes the complexity of a sentence. Word depth is 
theoretically related to the memory load imposed by sentence 
structure during generation of a sentence. Each word is 
assigned a "depth" as a function of how many structural char- 
acteristics of a sentence must be kept in mind at the time 
the word is produced. The greater the number of characteris- 
tics to be remembered, the greater the depth. Determination 
of word depth usually requires a diagram of the phrase (or 
constituent) structure of the sentence. 
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The relationship between word depth and reading 
difficulty has been found to be positive, though the signifi- 
cance of the relationship Is not yet clear. Correlations as 
high as r - .78 have been reported between mean word depth 
and the comprehension difficulty of passages (Bormuth, 1964). 
However, word depth scores are highly correlated (r = .86) 
with sentence length (Bormuth, 1966), and both Bormuth (1966) 
and MacGlnltle & TretlcOc (1969) found that word depth Is no 
better a predictor of readability than Is mean sentence length. 

Another predictor variable derived from trans- 
formational grammar Is the ratio of kernels to sentences or 
words. Kernels are the simplest sentence units that are 
transformed to make more complex sentences. For example: 
We applauded his brilliant performance is built up from three 
kernels : He performed ; He was brilliant ; We appluaded him . 
Sentences that contain many kernels are syntactically more 
complex than sentences that contain only one kernel. Coleman 
(1971) found a correlation of r = -.77 between cloze criterion 
scores and an indirect estimate of the number of kernels. 

Other indices of syntactic complexity derived 
from transformational grammar have been proposed, such as 
the number and type of transformations (Brown, 1967), or 
depth of subordination and deletions from deep to surface 
structure (Chomsky, 1971) , but these measures have not yet 
been tested as predictors of criterion readability scales. 
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Finally, Ruddell (1965) foxmd that the reada- 
bility of passages for an audience of children is signifi- 
cantly related to the frequency with which syntactic structures 
in the passages appear in children's speech. However, Bormuth 

1966) found a correlation of only r « .13 between passage 
difficulty and the frequency of structures in children's 
speech. Perhaps the factor of congruence between speech pat- 
terns and written syntactic structures affectb passage reada- 
bility only for children. 
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C. Construction of the RRI Readability Formula 

1. Reasons for building a new formula . Earlier in 
this report, it was established that the ability to measure 
the readability of text accurately is essential to the develop- 
ment of the RRI reading effectiveness measure, since the vali- 
dity of the reading comprehension section of such a test 
depends on the ability to define precisely the difficulty of 
material that students can read. However, RRI's review of 
the readability literature, summarized earlier in this chapter, 
led to the conclusion that existing formulas for scaling 
readability are not suitable for use in building the RRI 
effectiveness measure. Available formulas were built to 
predict criteria of questionable validity. In addition, they 
only account for about half of the variability in the crite- 
rion scales that they were built to predict. Their use results 
in relatively large errors of measurement, and they are limited 
in the range of difficulty to which they are applicable. Thus 
a better means for scaling text is needed than present formulas 
provide . 

The review of criterion scales of readability led to 
the conclusion that the cloze procedure (where subjects guess 
words that have been deleted from text) is the most reliable 
and valid way currently available for scaling readability. 
However, the very large quantities of text that probably will 
require scaling during the course of the proposed work on 
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reading effectiveness make it impractical to use the cloze 
procedure to scale directly the readability of all the mate- 
rials . 

The practical alternative suggested is to build a 
formula to predict readability, using cloze scores of selected 
passages as the criterion scale during construction. It is 
proposed that the 36 passages of the Miller-Coleman (1967) 
scale be used initially for this purpose. As noted earlier, 
this scale is probably the best available criterion of reada- 
bility. Unfortunately, because the Miller-Coleman bilateral 
cloze scores are based on relatively few subjects, these 
scores may not be as stable a criterion as the RRI formula 
will require. Therefore, RRI proposes to administer cloze 
tests on the 36 Miller -Coleman passages to an appropriately 
large sample of readers to establish highly stable scale 
values. 

Once the improved Miller-Coleman scale values have 
been determined, the following general strategy will be used 
to build the formula. First, several variables that should 
be predictive of passage difficulty will be selected, and 
the values of these variables calculated for each passage. 
Second, the correlations of these variables with each other 
and with the criterion measure of reading difficulty will be 
calculated. Third, using this information, several different 
algebraic formulas that reproduce the improved Miller -Coleman 

F/281-5-10-1 " - 



120 



scores will be generated. Fourth, the best of the alternative 
formulas will be identified by comparing their ability to 
predict accurately and reliably the values of the Miller- 
Coleman criterion scale after it has been expanded to include 
many more passages. Fifth, the validity of the formula will 
be verified using a new set of passages. 

2 . Constraints in the construction of the formula . 
The volume of material that probably will need to be scaled 
for readability in building the RRI effectiveness measure is 
so large that any human processing of the raw text, such as 
hand counting of any predictor variables, must be ruled out 
for practical reasons. RRI's choice of predictor variables 
is thus restricted to those that can be calculated on a com- 
puter. 

The requirement for machine countable predictors 
rules out, at least for the present, predictor variables such 
as word depth or transformations from deep to surfact struc- 
ture, since these require hand analysis by a linguist. For 
the same reason, RRI cannot include stylistic varieOjles of 
the type proposed by Chomsky (197,1) , such as figures of speech, 
unusual choices of vocabulary, unusual word orders, and unusual 
sentence constructions. 

While using a comimter imposes constraints such as 
these, it also provides an opportunity to consider many more 
variables than could practicably be counted by hand. Since 
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. data reported by Bormuth (1966) suggest that the eU^ility to 
evaluate new variables increases the chances of developing a 
powerful formula, the use of a computer provides a decided 
advantage . 

However, in view of computer costs, RRI will proba- 
bly confine its investi<jations to predictor variables whose 
calculation involves the simplest possible computer processes 
consistent with achieving adequate predictive power. V^ixle 
the calculation of predictor variables during the initial 
stages of the construction of the formula would be relatively 
inexpensive regardless of the complexity of the computer pro- 
cesses involved (since only a limited number of passages need 
to be scaled) , computer costs are bound to mount when analyzing 
the several corpora (see Chapters IV and V) that will have to 
be examined over the course of the proposed test construction 
effort. Therefore, there are advantages to keeping the varia- 
bles in the formula as simple as possible. 

3. Stages in the construction of the formula . The con- 
struction of the formula will proceed through several stages. 

3.1 The generation of algebraic formula (s) . In the 
initial stage, what is believed to be a good set of predictor 
variables will be selected and the value of each varieUale in 
each of the 36 passages will be determined. After calculating 
the correlations of the predictor variables with each other 
and with the criterion, one or more regression formula (s) can 
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be composed to predict the improved Miller-Coloman scores. 
Since many variables arf^ availeible to be included in the 
regression equation, it is likely that several formulas can 
be constructed that will exactly predict the criterion scale 
scores. During this first stage of constructing the formula, 
issues of economy on the computer will be ignored, since the 
maximum possible information about cemdidate predictors must 
be obtained. 

a. The selection of predictor variables to be 
tested. Based on evidence of their predictive power in pre- 
vious research, the following measures of vocabulary dif- 
ficulty will be evaluated in building a first-order formula: 
presence or absence of a word on an "easy" word list, word 
length in letters, and word length in syllables. As soon as 
it is feasible to do so (i.e., once word probabilities are 
established from an analysis of a corpus of English materials) , 
the frequency of occurrence of words will be tested as a 
predictor variable. Sentence measures that will be tried 
because of their demonstrated predictive power in previous 
work will include sentence length in words, sentence length 
in syllables, and sentence length in letters. 



^ The number of syllables can be closely approximated by a 
count of the number of vowels, plus the letter "y" (Coke & 
Rothkopf, 1970). 
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In addition to evaluating variables that have 
proved to be useful in previous readability research, a 
number of other candidate variables that intuitively appear 
promising will be evaluated. The number of commas, and other 
intra-sentence punctuation marks, will be counted, since 
these should signal sentence complexity. The proportion of 
words starting with letter combinations that frequently signal 
relational words also will be counted, such as those starting 
with wh and th (excluding "the") . Because many words begin- 
ning with wh and th are common function words, they tend to 
be both short and on all lists of "easy" words. This fact 
could lead us to underestimate the difficulty of some pas- 
sages, since the wh and th -vords may signal greater syntactic 
complexity than other words on "easy" lists of comparable 
length . 

Stylistic variables have not usually been 
included in readability formulas because of the subjectivity 
and difficulty of rating them, even though it has been noted 
by writers in the field (e.g., Klare, 1963; Bormuth, 1966) 
that the inability of formulas to take stylistic variables 
into account reduces their predictive power. At least one 
mechanically iialculable feature that should reflect elements 
of style will be evaluated, namely, variability in sentence 
length . 



F/281 5-10-1 



- 112 - 



ERIC 



124 



b. The resolution of technical problems. In the 



initial stage of formula development, problems associated 
with a possible combination of linear and nonlinear predictors 
must be resolved. Virtually every existing readability for- 
mula has assumed a linear relationship between all predictor 
variables and readability; yet, as shown in Bormuth's (1966) 
work and elsewhere, this assumption is probably not true. 
Thus linear correlation models may be unsuitable for deter- 
mining the relationship of predictor variables to each other 
and to the criterion. 

Problems associated with establishing the alge- 
braic form of the formula also must be analyzed and solved. 
Virtually all readability formulas add the weighted values 
of predictor variables. However, it seems intuitively plau- 
sible that their product (or perhaps some weighted geometric 
mean) may prove to be a better predictor. 

3.2 Comparing alternative formulas . Assuming that the 
first stage (3.1) yields several formulas, all of which 
exactly predict the improved Miller-Coleman scores, they will 



At the time of writing (July, 1973) , we have completed the 
first few steps towards building the formula. The 36 Miller-' 
Coleman passages have been keypunched, and the resulting deck 
of some 500 cards has been thoroughly checked. A program has 
been written which can read the cards, separate the entries 
into individual words, and perform a variety of counting and 
averaging operations. Thus, as soon as the improved Miller- 
Coleman scores have been calculated, RRI will be in a position 
to proceed to the actual construction of the formula. 
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need to be compared to determine which one should be selected 
as the RRI formula. To make this comparison, new passages 
will be added to enlarge the original criterion scale, and 
the formulas that predict the expanded scale most accurately 
will be identified. 

a. An expanded criterion . There are two reasons 
for evaluating the various formulas in terms of their ability 
to predict an expanded criterion scale. First, the 36 Miller- 
Coleman passages iuay not adequately represent the range or the 
intervals of difficulty of adult-level English text. Second, 
the observed scale values of the variables in the 36 passages 
may not be typical of a larger random sample of English text, 
i.e., the language in the 36 passages may, for some reason, 
be atypical. The addition of more data points increases the 
chances that thu criterion will be truly representative of 
the readability of English text. Hence a formula that can 
predict such a criterion scale should also be able to predict 
the readability of almost any passage of English text selected 
at random. 

To expand the criterion scale, cloze tests will 
be prepared for each of a large number of randomly selected 
passages of text. These tests will be administered to a 
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large number of readers^''' to establish stable cloze scores 

12 

for the passages. Once scaled for readability, these new 
passages will be added to the 36 Miller-Coleman passages to 
create an expanded criterion scale. In enlarging the crite- 
rion scale, RRI will include pairs of passages with duplicate 
cloze scores and single passages that duplicate the cloze 
scores of passages on the original scale, in order that for* 
mula reliability can be studied (see below) . 

b. comparing the formulas . The ability of the 
various formulas to predict accurately and reliably the cloze 
scores of the expanded criterion scale will be compared. 
Since passages ranging from very easy to very difficult will 
be scaled in building the reading effectiveness measure, the 
ability of formulas to predict accurately over the entire 
readability range will be compared. Since the reading effec- 
tiveness test to be taken by any one student will cover a 

relatively small segment of the readability scale, the ability 
of various formulas to predict accurately within a narrow 

range of difficulty, i.e., the ability of the formulas to 



Highly competent readers should be used to ensure that an 
adequate spr(;ad in the cloze scores of the most difficult 
passages is obtained. 

Since as many as 200 passages may be included in the 
expanded criterion scale, it would be unreasonable to expect 
any one subject to take a cloze test for every passage. There- 
fore, statistical designs that allow for some set of n passages 
to be assigned to each subject will be required. 
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discriminate between passages whose readability difference is 
small, will also be compared. Finally, since there will be 
passages on the expanded scale that have duplicate cloze 
scores, it will be possible to compare the reliability of the 
various passages, i.e., "the ability of the formulas to assign 
matching readability scores to passages that have matching 
cloze scores. After analyzing the results of all these com- 
parisons, unsatisfactory formulas will be discarded. 

.H.3 Conducting trade-off analyses . Even after unsatis- 
factory formulas have been discarded, it is quite possible 
that several formulas will remain, all of which predict the 
expanded criterion scores with perfect or near perfect accu- 
racy. Up until this point in the development of the formula, 
issues of cost have not been actively considered (apart from 
the imposition of constraints in the selection of predictor 
variables, discussed earlier). To decide among these alter- 
native satisfactory formulas, trade-off analyses will be 
carried out in which the power of each formula is evaluated 
against the computer costs associated with using it in cal- 
culations. These trade-off analyses should enable the most 
cost-effective formula to be identified. 

In carrying out these analyses, particular interest 
will be focussed on examining the trade-offs associated with 
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choosing a formula that has a large number of variables, 

13 

especially if any of those variables are costly to calculate. 

In principle, the predictive power of any formula 

should be increased by incorporating into it more and more 

variables that correlate with the criterion. However, previous 

research in readability (e.g., Bormuth, 1966; Farr, Jenkins & 

Paterson, 1951) suggests that after a few of the most pertinent 

variables have been used to compose a regression formula, the 

2 

additional contribution to r of more variables becomes very 
small. Therefore, in selecting the final formula, the con- 
tribution of each variable against its costs will be weighed 
carefully . 

3.4 Verifying validity . Finally, after the most cost- 
effective formula has been selected, its ability to predict 
accurately the cloze scores of an entirely new set of passages 
must be verified. Since the formula will have satisfied rigor- 
ous criteria during the selection process, its ability to 
predict scores of new passages is reasonably assured. How- 
ever, classic measurement theory requires cross-validation, 
hence RRI proposes to carry it out. 

In order to carry out these trade-off analyses, it may be 
necessary to solve some computer systems problems. Probably 
the largest problem will concern the practical use of word 
frequencies as a measure of vocabulary difficulty. To make 
use of word frequencies, ways will need to be found to reduce 
the computer time and costs currently associated with table 
look-up operations . 
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To carry out the cross-validatioiw sl large number 
of subjects would be asked to take cloze tests over a new 
set of passages. To check the validity (i,e., the predictive 
accuracy) of the formula, cloze scores predicted by the for- 
mula for these new passages would be compared with the scores 
actually earned by the passages. 
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Chapter IV 
Implementing the Design Concepts in the 
Construction of Reading Tests 
The development of the readability formula and word-type 
distributions discussed in the preceding chapters will make 
possible the construction of tests to assess students' knowl- 
edge of words in various frequency of occurrence bands and 
students ' ability to comprehend passages of text of various 
levels of difficulty. In this chapter, the procedures for 
uuilding sucn tests will be outlined. The discussion will not 
cover the details of test construction, since the task of 
developing a "test blueprint" did not fall within the contract 
period covered by this report; rather, the discussion covers 
the major design concepts for building tests of reading achieve 
ment, noting differences between these concepts and current 
practices in norm-referenced test construction. 
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A. The Specification of a Corpus of English Text 

Tne first step in the test construction plan is the speci- 
fication of a corpus that can be used both to compute the fre' 
quencies of words that occur in adult English text and to 
scale the readability of adult reading materials. The accuracy 
of the readability and word frequency data will depend on the 
extent to which materials selected for the corpus are repre- 
sentative of the universe of written English that adults 
encounter. To be truly representative of this universe, the 
materials that make up the corpus must span the range of diffi- 
culty found in English text and must include all types of 
reading materials used by the adult population, in proportion 
to their extent of use. 

To insure the representativeness of the corpus, two 
requirements must be met. First, because the universe of 
written English is very large, a domain (or subset) of materials 
must be identified that is comprehensive and that is amenable 
to systematic and objective sampling. Second, the domain must 
ue systematically and objectively sampled to form the corpus. 
Thus, in a manner of speaking, the corpus emerges as a repre- 
sentative sample of a representative sample. 

The need for objectivity and comprehensiveness in the 
construction of a corpus led RRI to examine the feasibility 
of using the contents of the Library of Congress as a domain 
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representative of the universe of written English. The Library 
of Congress is the largest general collection of adult reading 
materials available in this country. If text from each cate- 
gory in the Library's classification system were sampled in 
proportion to the size of the Library's holdings in that cate- 
gory, a representative corpus of English text should be obtained. 

However, a preliminary examination of the classification 
system used by the Library revealed that the system is probably 
not amenable to systematic statistical sampling. The highly 
complex classification system is enumerative rather than ana- 
lytic, making it hard to determine what constitutes a category 
of materials. Moreover, the classification scheme differs 
somewhat in each major division of the Library (presumably to 
meet the particular needs of each division) and is constantly 
being extended. Consequently, the size of the holdings in 
different categories would be difficult to determine. When 
these problems were uncovered, it was decided that, bscause of 
the classification system used, the Library's collection should 
not be used to obtain a representative sample of materials. 

A simpler and more satisfactory procedure for obtaining 
a representative corpus of adult materials has been devised. 
This procedure is based on the identification of the domain 
of all periodicals as representative of the universe of written 
English and as an appropriate domain from which the RRI corpus 
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can be built through systematic and objective sampling. This 
domain includes newspapers, magazines, journals, and other 
materials published at regular intervals. 

Periodicals should reflect the full spectrum of activities 
and concerns of society, and therefore should be representative 
of adult reading materials with respect to content. Their text 
Siiould vary widely enough in difficulty to be representative 
of the readability of written English. The availability of 
such reference works as the Reader's Guide to Periodic Litera - 
ture makes it possible to employ unbiased and systematic sam- 
pling procedures, such as the use of random numbers to desig- 
nate which pages and which entries per page should be Sctmpled. 
Furthermore, since circulation figures dri,> available for peri- 
odicals, text from different periodicals can be saunpled in 
proportion to the size of their circulations. Thus, the domain 
of periodicals is amenable to systematic sampling procedures 
to define a representative, adult corpus. 

Practical considerations also dictate the use of periodi- 
cals as a domain from which the corpus will be defined. 
Assuming that a representative sample is drawn from current 
issues, the corpus can be assemblcjd easily and inexpensively. 
There should be no problem locating the materials, and no 
likelihood of encountering archaic language. 
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ThuS/ periodicals appear to be a promising domain from 
which an adult cor^ms could be formed. However, there are 
two possible objections that could be raised to the use of 
periodicals, both of which are amenable to empirical resolution. 
One objection is that periodicals do not cover as broad a range 
of subject area? as books. The validity of this objection can 
be tested by randorul;.^ sampling books from a reasonably compre- 
hensive collection (e.g., a large library) and determining the 
extent to which the subject matter of the sampled books is or 
is not contained in periodic*. ^. The second objection is that 
the voccU^ulary and readability of materials within a subject 
area may differ systematically between books and periodicals. 
This objection can be tested by drawing random samples from 
books and periodicals in any field of study, and comparing 
them. 
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B. Calculating the Readability Values of Text and the 
Familiarity of Word Types 

When the adult corpus has been identified, assembled, 
and entered into a computer, the readability of and word type 
distributions in the materials that make up the corpus can be 
calculated. The readability of materials in the corpus will 
be determined using the forxiiula described in Chapter III. It 
should not be necessary to analyze the materials in the corpus 
in their entirety to calculate their readability values. 
Rather, it should be possible to base readability data on 
sample passages from text. Further study will be required to 
determine the number and length of passages tha^: must be ana- 
lyzed per periodical to achieve various levels of reliability 
in the readability estimates. Since longer passages and larger 
numbers of passages usually give more stable estimates of 
readability than shorter and fewer ones, the levels of relia- 
bility desired need to be weighed against the costs of 
increasing the size of the sample. These sampling decisions 
must be made before the formal analysis and scaling begin. 

To determine word type distributions, every word in the 
materials selected for the corpus will be tabulated. For this 
purpose, a series of leromatization rule? are required, defining 
wiien words are to be counted as same or different. As noted 
in Chapter II, the lemmatization problem has been attacked by 
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other investigators, and it is possible that adequate pro- 
cedures already exist for carrying out word-type frequency 
counts. If existing leminatization procedures can be applied 
to the analyses of materials selected for the RRI corpus, 
rrany months of work will be saved. 
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C. Trade-Q£f Analyses; Precision vs. Complexity of 
Information 

Before tests can actually be built, decisions must be 
made concerning the range of readability values and the number 
of word familiarity bands to be measured. These decisions 
cannot be made arbitrarily; rather, they depend on the outcome 
of trade-off analyses that weigh the costs and benefits asso- 
ciated with attaining precise measurements against those asso- 
ciated with attaining complex information from the reading 
test. Test construction is constrained by the fact that 
increasing either the precision or complexity of information 
obtained from the test raises testing time and costs. When 
time or costs are fixed, increases in precision can be gained 
only at the expense of complexity, and vice-versa. The need 
for precision in measurement must therefore be carefully 
examined in relation to the need for complex information. 

To be useful in assessing growth in reading achievement, 
test scores must be replicable within small error, i.e., scores 
must have high reliability. The need for a reliable measure 
in detecting changes can be illustrated by a simple example. 
Suppose the weight of a person before and after a two-week 
diet is to be compared to see if he has lost weight. If the 
scale used is reliable only within five pounds, an observed 
difference of two pounds cannot be regarded with confidence 
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as a real loss In weight. Because of the relatively low 
reliability of the measuring instrument (the scale, in this 
case) any two weighings could yield as great a weight differ- 
ence as that observed over the two-week time interval. Thus, 
the fact that the scale is only reliable within five pounds 
defeats the purpose of the measurement task at hand, which 
requires the detection of smaller differences. This scale 
may be perfectly adequate for many purposes, but not for taking 
measurements where differences of less than five pounds need 
to be detected reliably. 

Similarly, if scores on a reading test are not reliable 
within a sufficiently narrow range relative to the change to 
be detected, the measurement error may be too large to permit 
a firm conclusion that an observed difference represents a 
true, rather than a chance, change in test score. 

If only gross changes are of interest, such as differences 
between the reading skill of a first grader and a twelfth 
grader, a test which detects large changes reliably would be 
adequate to the task. But since it is essential to detect 
growth over much shorter time periods, e.g., from the beginning 
to the end of a school year, the test must measure small 
changes reliably to allow differences of this small magnitude 
to be detected with confidence. The capability of measuring 
small cnanges reliably requires, by definition, a high pre- 
cision of measurement. 
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In order to detect small differences, it is necessary to 
minimize the chance variability in test scores. Some varia- 
bility in test scores obtained at different times almost 
always occurs. Part of this variability may be attributable 
to real changes in the skill being measured, i.e., the subject 
has become more capeible during the time interval between 
measurements. However, there are factors other than a true 
change in skill that can account for differences in test per- 
formance. To increase the likelihood of detecting a true 
change in performance, these other sources of score variability 
must be minimized. 

The principal source of chance score variability is the 
test itself. Each item on a test represents a single observa- 
tion of behavior. Relative to the total number of observations 
that could be made (i.e., the universe of test items that could 
be written) , the number of observations actually made on any 
test, is small. Whenever few observations are made, there is 
the chance that unusual or atypical instances of behavior will 
be observed and, becau.3e the total number of observations is 
small, will have a i-aarked effect on the total score. For 
example, one set of items can turn out to be much easier for 
a particular student than another set because, by chance, the 
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first set contains a few items whose subject matter is particu- 
larly well known to the student.^ 

The most common way to reduce variability in scores stem- 
ming from item sampling peculiarities is to increase the number 
of items. Longer tests yeild more stable estimates of per- 
formance because they are based on a greater number of obser- 
vations and, therefore, represent a more adequate seunple of 
performance. As more items are sampled, the chance factors 
associated with individual items have less influeiice on the 
total score, and tke total score is therefore more stable. 

While in theory there is no barrier to increasing test 
length (the more observations that are made, the better), test 
length in practice is constrained by several factors , especially 
by the available testing time and the hourly cost of keeping 
a student in school. With time and budget limited, the use of 
as many items as possible to increase accuracy in measuring 
one skill often conflicts with the desire to use the test to 



Item sampling factors are not the only source of score varia- 
bility. Temporary conditions unrelated to the skills being 
measured can also affect scores. For excunple, the subject may 
feel ill at one testing tLne, and hence not perform as well as 
he otherwise might; the room may be overly stuffy at one test- 
ing session; and so on. The effects of such sources of unre- 
liability are present to some extent in all cognitive test 
scores. While these sources of unreliaibility present real prob- 
lems in measurement, their control is properly the concern of 
test administrators and test interpreters, rather than the con- 
cern of test constructors. 
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measure more than one skill, i.e., to obtain complex informa- 
tion from the test. Unfortunately, with limited available 
time, increases in the precision of measurement can be obtained 
only at the expense of loss in the complexity of information. 
This poses a real dilemma, since the need for accuracy may 
make it impossible to measure all the skills about which infor- 
mation is desired. 

Let us assume that a total of 30 minutes is available for 
testing vocabulary. If all 30 minutes are devoted to testing 
students' knowledge of words in one frequency of occurrence 
band, a precise measure of students' word knowledge in that 
frequency band probably could be obtained. However, no infor- 
mation would be obtained about their word knowledge in other 
frequency bands. On the other hand, if all frequency bands 
are tested, the number of observations made in each band may 
be too small to yield sufficiently precise and reliable data 
for detecting growth. 

The level of precision required in a test will depend on 
whether group growth or individual growth is to be detected. 
Group scores are more stable than individual scores because 
pooling data over W individuals results in a degree of preci- 
sion in the group score equivalent to having N times the number 
of observations on a single individual. Consequently, a test 
designed to detect group growth can employ fewer test items 
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to measure any single skill to i:he required degree of preci- 
sion, and hence can measure a larger number of skills during 
a testing session of given length, than a test designed to 
detect individual growth. 

Another element in the compromise between precision of 
measurement and the number of different skills measured (com- 
plexity of information) depends on whether the test will be 
used to determine if a student has attained a specified minimum 
standard of reading competence, or whether it will be used to 
determine the student's precise level of reading ability. 
Less testing time is required to determine whether or not a 
student has reached some minimum standard of reading competence 
than is required to specify his level of reading ability. 

To determine whether a student has met minimum standard 
of competence, precision is required only at one point, X, 
on the scale (i.e., is his ability equal to or greater than X) , 
whereas to determine a student's level of reading ability, 
precision is required at two points, X and Y, on the scale 
(i.e., is his ability equal to or greater than X and equal to 
to or less than Y) . Therefore, if it is decided to test only 
for attainment of minimum standards, the lower requirements 
for testing time should make it possible to measure a larger 
number of skills than if the student's precise level of achieve 
ment needs to be determined. 
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An important element that will be required in the trade- 
off analyses is information concerning the quantitative relation 
betv*een test length and test reliability for tests of the type 
discussed in this report. While it is generally true that 
longer tests (in which many observations are made) are more 
reliable than shorter ones, it is important to note that the 
well-established procedures for calculating the reliability of 
norm-referenced tests, and for estimating the effects of 
increased test length on the reliability of norm-referenced 
tests, and for estimating the effects of increased test length 
on the reliability of norm-referenced tests, are not applicable 
to the proposed RRI reading effectiveness measure. 

The inapplicability of norm-referenced test methodology 
to determining the reliability of the RRI tests is not 
surprising, in view of the different purposes of these types 
of tests and the different concepts of reliability that follow 
from these different purposes. The purpose of a norm-referenced 
test is to discriminate among different persons' performances 
on a particular set of test items, while the purpose of the 
RRI test is to measure different persons' performances 
relative to some set of standards. Thus the concept of 
reliability for norm-referenced tests is framed in terms of 
stcUale discrimination, i.e., reproducing (e.g., in two or more 
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test administrations) the same rank-ordering among the members 
of a group of subjects irrespective of the numerical values 
of their scores, while the concept of reliability for the RRI 
test must be framed in terms of stable performance relative to 
the standards, i.e., reproducing (e.g., in two or more test 
administrations) the same numerical score for each subject 
taking the test. 

Norm- referenced tests calculate reliability by correlating 
two sets of scores (e.g., from two administrations of different 
versions of the test) for each of N persons. Since correlation 
is mostly sensitive to the ordinal relationships among the 
scores in the two sets, reliability in norm-referenced tests 
is largely determined by the replicability of the relative size 
of scores, regardless of how large or small the number ical 
values of those scores may be. As long as a test places sub- 
jects in the same high- to-low order with repeated measureme its , 
reliability as measured by any correlation coefficient will 
be high.^ 

In the RRI reading effectiveness measure, however, 
reliability should be defined and calculated in terms of the 



Note that r * .975 when X and X are correlated for the 
integers 1 to 20 (i.e., when the sets of correlated numbers 
are 1 and 1, 2 and 4, 3 and 9, 4 and 16, ... 20 and 400, etc.). 
Thus r is very high even though the values of the numbers 
within each pair of scores differ consider Jdsly from each other. 
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degree to which the numerical values of test scores are repli- 
cable. A test with a "norm-referenced" reliability coefficient 
of r = .99 would be unreliable for RRl's purposes if all 
scores on one test were uniformly higher or lower than those 
on another. Therefore, when the mathematical properties of 
the reading effectiveness scale have been determined, new 
measures of test reliability appropriate to the purpose of the 
RRI reading test need to be developed.^ 



3 

Until an empirical attempt is made to construct and use the 
proposed reading effectiveness measure, we will not have suf- 
ficient information to define an arbitrary zero. In addition, 
until data are available, determination of whether the scale 
formed by reading effectiveness scores is continuous over all 
intervals, or whether it meets the requirements of an interval 
scale cannot be made. These and other issues need to be inves- 
tigated before a meaningful measure of reliability can be 
formulated. For these reasons, a formal discussion of the 
issue of overall test reliability has been deferred in favor 
of concentrating on identifying and reducing those factors 
which contribute noise to measurement. 
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D . A Plan for the Efficient Use of Testing Time 

Whatever decisions emerge from the trade-off analyses 
concerning the test's level of precision and complexity of 
information, testj.ng time should be used as efficiently as 
possible. One way to increase the precision of measurement 
obtainable within a fixed period of time is to use a branched 
testing strategy (Chronbach, 1970). The principle of branched 
testing is to locate rapidly the student's approximate level 
of achievement on the skills being tested.- and then to assign 
to each student a test whose items are concentrated around 
that level. Location of the student's approximate achievement 
level can be accomplished either by first giving a short, 
broad-spectrum test or by using information concerning per- 
formance on a previous test. By concentrating testing right 
around the level of the student's achievement, the number of 
relevant observations is increased, and it is possible to 
obtain a more precise and reliable estimate of his current 
reading skill level. 

In measuring paragraph comprehension, the advantage of 
using a branched testing strategy is clear, since little 
information can be gained by using testing time to have stu- 
dents read passages that are much too easy or much too difficult 
for them. 
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In measuring vocabulary, the application of a branched 
testing strategy is less obvious. Ideally we would like to 
be able to measure students* word knowledge in several fre- 
quency bands. However, a certain minimum level of reliability 
for word knowledge scores in any one band must be obtained, or 
it will be impossible to detect growth in that band. There- 
fore, it may be impossible to test in as many frequency bands 
as might be desired. 

Branched testing strategy would call for concentrating 
testing in the band or bands where it is most important to 
detect growth. This band is likely to change as students get 
older. For excimple, with younger students it may be most 
important to detect growth in knowledge of common words, where- 
as with older students it may be most important to detect growth 
in knowledge of moderately rare words . Some empirical evidence 
will be needed concerning children's knowledge of words in 
various frequency bands at different ages before deciding how 
best to implement a branched strategy in testing vocabulary. 
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E, A Plan for the Construction of Non-Biased Tests 

In order to provide unbiased estimates of students' read- 
ing achievement, the trsts should meet certain criteria of 
fairness. First, ice.i ccntent should be unbiased with respect 
to. programs. All students, regardless of the program of . 
instruction they have received, should be equally prepared 
for the tefat items. Thus, there must be sufficient overlap 
between programs in the readability of materials used and 
in their vocabulary content to yield a set of words and a 
range of readability on which all students can be tested. If 
such overlap does not exist, the tests would be biased against 
students in some programs. 

Evidence suggests that in the elementary grades there is 
relatively little overlap between major reading textbook series 
with respect to the grades in which particular vocabulary words 
are introduced. Stauffer (1966) analyzed the vocabulary in- 
troduced in each of the first three grades in seven basal 
reader series. He found (Table 2) that the number of new words 
common to all series in each grade was remarkably small. 
Analyses are therefore required to determine the earliest grade 
at which the major reading programs are sufficiently similar 
to one another in the readability and vocabulary of their 
instructional materials to make it feasible to build tests that 
will be unbiased with respect to programs. 
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Table 2 

Vocabulary Words Introduced in 
Seven Basal Reader Series^ 



Grade 

1 
2 
3 

Total 



Total number of new 
Words introduced 

570 
1,289 
2,155 
4,014 



Number of new words 
coimnon to all series 

117 
13 
7 

137 



a 



Based on Stauffer, 1966. 
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Second, test items should not penalize some groups of 
children or give unfair advantage to others (as has been 
charged of some norm-referenced tests) by including content 
that is likely to be more familiar to some groups than others 
because of ethnic background, social class, or place of 
residence. 

Third, all test items should be as objective as possible. 
While standardized, norm-referenced tests are called "objec- 
tive," the term applies only to scoring procedures. It does 
not apply to item construction, which is subjective in both 
the selection of questions and the generation of response 
options. In the proposed reading effectiveness measure, the 
goal should be objective item construction as well as objective 
scoring procedures. 

Fourth, reading and comprehension of the passages should 
be both necessary and sufficient to answer the test questions 

rrectly. In some standardized tests, it is possible to 
answer certain items correctly without having read a test 
passage, because the item deals with general information that 
the student may have, independent of the material provided. 
In others, it is possible to miss certain items, even after 
having read and understood the passage, because the answer does 
not appear in the material provided. It is inappropriate to 
draw any conclusions concerning a student's reading skill 
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unless we can be certain that he has both read the test pas- 
sage and that the passage provides all the Information needed 
to answer the questions asked. 
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F. A Proposed Item Format for Passage Comprehension 



With the above considerations in mind, it is proposed 

that items for the reading comprehension test be written in 

a quasi-cloze format. In this item format, a word is deleted 

from text and replaced with a blank of standard length. The 

student must choose the deleted word from among several options 

provided following the end of the passage. The rationale for 

this item format is similar to that governing the use of the 

true cloze procedure as a measure of comprehension: the 

better a student understands what he is reading, the better 
*** 

k 

^ The following additional item formats may be used if a 
procedure can be worked out for in'^erring from a student's 
responses to a set of questions ir-iasuring discreet pieces of 
information that he comprehends an entire passage. 

Identification of missing facts or ideas . In this item for- 
mat the student is asked to identify which of several ideas, 
people, problems, etc. is not mentioned in the passage. The 
student selects his answer from among several options, all but 
one of which have been mentioned in the passage. This item 
type is designed to measure students' understanding of the 
facts presented in a passage. Wording i3 altered between text 
and response options so that simple word matching will not 
yield the correct answer. 

Vocabulary meaning in context . Several standardized tests 
purport to measure word meanings in context, but examination 
of the actual items shows that often they are ordinary vocabu- 
lary items, with context having . little or no effect on word 
meaning. The item type proposed here is one in which the stu- 
dent must understand the passage to select the correct meaning 
of the word in context, since all response options will be 
genuine meanings of the word being tested. The correct 
response will alternate randomly between the dominant and 
secondary meanings of a word. Since most common words have 
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he should be able to guess a word that has been deleted from 
text. Items of this type are used in several major stan- 
dardized tests. ^ 

While this item type bears some resemblance to items in 
a true cloze test, it differs from regular cloze items in a 
number of important respects. Most importantly, in the pro- 
posed format the subject chooses the answer from among several 
response options provided, rather than generating the missing 
word himself. In addition, many fewer words are deleted in 
the proposed format than in a true close test (e.g., five 



several meanings, it should not be difficult to construct such 
items with the help of a good unabridged dictionary. 

Question about facts in passages . Bormuth (1968) has suggested 
that the item writing process can be made objective by con- 
structing questions that are interrogative grammatical trans- 
formations of the syntax of sentences appearing in the passage. 
To make items of this type, a word, phrase, or clause is 
deleted from a sentence and is replaced by a question marker, 
transforming the sentence into a question for which the correct 
answer is the element that was deleted. For example, when 
various transformations are applied to the sentence "John rode 
the horse at the farm," the following questions result: Who 
rode the horse at the farm? and What did John ride at the 
farm? and Where did John ride the horse?, etc. This procedure 
permits a kind of control of item difficulty across passages, 
since the number and form of transformations (hence, questions) 
can be specified in advance, and randomly assigned to passages. 
By using this procedure, subjective judgments of item writers, 
concerning the suitability and comparability of questions can 
be avoided. This seems to be a simple and direct way to 
determine whether the reader understands facts that are explic- 
itly stated in a passage. 

^ Quasi-cloze items are used in the current editions of the 
Gates-MacGinitie Reading Test and the Stanford Reading Achieve- 
ment Test. 
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words per passage o£ 100 words or longer rather than periodic 
deletion of every fifth, seventh, tenth, etc. word). 

Earlier (see footnote 6 Chapter III) , we noted that cloze 
tests have one potentially serious shortcoming as measures of 
comprehension, namely that the redundancy of English facilitates 
correct restoration of words even when the meaning of a passage 
is not understood. To avoid this problem in using quasi-cloze 
items in the reading effectiveness measure, function words 
(articles, conjunctions, etc.) will not be deleted, since 
grammatical knowledge alone can lead to their correct replace- 
ment even when the content of the passage is not understood 
(MacGinitie, 1971). Instead, only content words (nouns, 
adjectives, verbs, adverbs) will be deleted. To reduce further 
the likelihood that the constraints of English will lead to a 
correct response even though the material is not understood, 
all response options within each item will be equated for part 
of speech, word frequency, and plausibility when inserted in 
the deleted space. With these constraints imposed, a correct 
response should occur only when (apart from guessing) the stu- 
dent comprehends what he has read. 
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G. Construct Validity 

There are two principal questions to be answered in estab- 
lishing the construct validity of the proposed reading compre- 
hension test. First, do the test items adequately measure the 
construct "reading comprehension"? Second, do the scoring 
procedures result in reasonable inferences concerning a stu- 
dent's comprehension of what he has read? 

1, Do items measure "reading comprehension" ? As de- 
scribed above, only one type of question (quasi-cloze) will be 
used in the reading comprehension test. It is assumed that 
this type of item taps a general comprehension factor. No 
items are proposed at this time for testing specific compre- 
hension subskills, such as recognizing facts, drawing inferences 
from what is said, getting the main idea, understanding the 
author's purpose, discerning mood, recognizing literary 

devices, etc. 

The decision not to measure reading subskills is a 
deliberate one made for the following reasons. We saw in an 
earlier section of this chapter that, in a test of fixed 
length, complexity of information can be obtained only at the 
expenbe of precision of measurement. Therefore, a decision 
to measure multiple skills would have to be based on a judg- 
ment that such complexity is worth obtaining, even at the 
expense of precision in the measurement of any one skill. 
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However, available data make it appear doubtful that the 
measurement of many subskills is worthwhile at the cost of 
test precision. 

A review of reading tests by Berg (1973) has un- 
covered more than 70 distinct reading skills that test publishers 
have tried to measure. This probably reflects a widespread 
belief among educators that reading involves multiple skills 
and abilities. However, a number of studies (e.g., Thurstone, 
1946; Harris, 1948; Hunt, 1957; Bormuth, 1969) indicate that 
the separate evaluation of all these factors is not warranted. 
As discussed earlier, the evidence seems to support the exis- 
tence of only two principal factors: knowledge of individual 
words and comprehension of connected text. The data suggest 
that, apart from the apparently distinct word knowledge factor, 
the variance in reading test scores can reasonably well be 
accounted for by a single general comprehension factor. Thus 
the L.aasurement of multiple comprehension subskills does not 
appear to be warranted. There are those (e.g., Davis, 1908, 
Lennon, 1962) who argue that measurement of a few distinct 
comprehension subskills (e.g., drawing inferences) is justi- 
fied; however, the demonstrated degree of independence of 
these subskills has thus far been relatively modest. 

It should be noted that the lack of convincing 
evidence for the existence of separate comprehension subskills 
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does not necessarily mean that such subskills do not exist. 
It is possible, for example, that investigators' failure to 
obtain distinct comprehension subskills in analyses of test 
data occurred because test items were poorly written, and 
therefore did not adequately measure the skills they were 
supposed to assess. However, in the absence of clear evidence 
of important independent comprehension subskills, RRI believes 
it is preferable to restrict the test to one item type measur- 
ing general comprehension of what has been read. Measurement 
of one general comprehension factor, rather than many sub- 
ordinate ones, should increase the precision of measurement 
possible in tiie test. Furthermore, the proposed item format, 
unlike items required to measure the various reading subskills 
and unlike other item types that tap general understanding of 
a passage, makes it possible to construct items objectively 
and, to some extent, mechanically. 

If the empirical data are correct in suggesting that 
all reading comprehension subskills are highly interrelated, 
then use of one item type that taps a general comprehension 
factor should adequately measure reading comprehension. How- 
ever, it will be necessary at some future time formally to 
establish that the test is valid, i.e., that it measures what 
is theoretically meant by "reading comprehension." One way 
to demonstrate the test's construct validity would be: to 
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follow procedures of the kind used in previous studies investi- 
gating the number of independent factors associated with the 
construct "reading comprehension." Essentially, this would 
involve asking questions of the type we have proposed, as well 
as questions of the type suggested by those who believe that 
reading comprehension involves many independent (or nearly 
independent) subskills. Once students' responses to the 
various kinds of questions were collected, it would be a 
relatively straightforward task to determine whether the RRI 
items measure the same factors as are measured by other types 
of items . 

2, Is test performance validly interpreted ? Performance 
on the reading effectiveness measure should be interpreted 
in terms of a subject's ability to comprehend materials 
written at various levels of readability. In order to draw 
such an inference, we must first define the performance that 
will be considered acceptable as an indicator of comprehension. 
We must decide, in other words, how many questions (all, most, 
some, etc.) a student must answer correctly before we credit 
him with understanding the passage on which the questions are 
based. 

a. Using expert opinion to define comprehension 
criterion. One way to determine the criterion of comprehension 
(passing score) for a passage is to ask experts to define it. 
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This procedure calls for an arbitrary decision as to the 
level of performance that will be considered acceptable. Since 
good models of comprehension do not yet exist, there is cur- 
rently no rational or empirical basis for concluding that the 
passing score for a passage should be, say, 90% rather than 
80%. Because the decision is an arbitrary one, no matter where 
the passing score is set there are bound to be those who will 
argue that a different criterion would be more valid. 

The problem of having experts set a criterion 
of comprehension is compounded by the fact that the definition 
of satisfactory comprehension must be a function of the purpose 
for which something is being read. There are some cases (e.g., 
medicine labels) where 100% comprehension is vital, but there 
are many other cases (e.g., newspapers) where less than 100% 
comprehension may be adequate for functional purposes. Thus 
it is understandable that experts should disagree as to a 
single best criterion of comprehension. Until good models of 
comprehension are available, these disagreements are unlikely 
to be resolved. 

b. Using probability statistics to define criterion 
of comprehension . A more productive strategy for defining a 
criterion of comprehension for the reading effectiveness 
measure may lie in a statistical approach to the problem. As 
the discussion in this section will show, probability statistics 
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so constrain the possible interpretation of test results that 
the statistical approach may well be the only viable one to use. 

Assume that the RRI reading comprehension test 
will be composed of N passages, each at a different level 
of readability. Further assume that testing time will be. 
limited to the amount of time normally devoted to a reading 
comprehension test (30-45 minutes) and that, as a practical 
matter, a total of about 30-45 questions can be asked. 

Few questions could be asked about each of many 
passages, or many questions could be asked about each of few 
passages. Since each of these extremes has distinct advantages 
and disadvantages (with regard to the range of readability that 
may be tested and the number of observations of behavior that 
may be made at each level of readability tested) , assume that 
a middle course is chosen, in which a moderate number of 
questions (four or five) is asked about a moderate number pf 
passages (six to eight) . 

It is an inescapable fact of multiple choice 
testing that some items can be answered correctly by guessing. 
The probability of correctly guessing an item depends on the 
number of alternatives from which one may choose and the 
plausibility of each of those alternatives. In theory, if 
there are N equally probable response options, the probability 
of guessing an item correctly if 1/1^. This is the assumption 
normally made by psychometricians,. e.g. , in correcting a test 
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score for guessing. However, as a practical matter, i£ an 
item is poorly written, one or more response options may not 
be regarded as plausible, and the probability of a correct 
guess may be greater than 1/N. 

For purposes of the argument below, it will be 
assumed that all test items have four or five response options, 
so that the probability of guessing the answer correctly if all 
responses have an equal likelihood of being chosen is p = .25 
or p = .20, respectively. It will also be assumed that, for 
extraneous reasons,^ the probability of a correct guess could 
be as high as p = .50. Actually, in most cases, the probability 
of a correct guess probably will lie somewhere between the 
boundaries p = .20 (for five response a Iter natives ) or p = .25 
(for four response alternatives) and p = .50. 

Table 3 shows the probability of correctly 
answering r out of N questions on the basis of chance alone. 
The p = .20 and p = .25 columns give the probabilities when 
standard psychometric assumptions are made concerning the equal 

^ The principal extraneous factor that could affect the proba- 
bilities of response options in the quasi-cloze items proposed 
is the comparative likelihood of occurrence of the various 
options in the sentence frames provided. As noted in Chapter 
III, some words are more likely than others to occur in a given 
sequence. Even if attempts are made to equate all response 
options for semantic plausibility, the sequential probabilities 
of English may affect the likelihood that various response 
options will be selected. Note that the possibility of unequal 
probabilities for the various response options is independent 
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Table 3 

The Likelihood of Guessing Correctly Various Numbers of 
Questions when the Probability (p) of 
Guessing the Correct Response is: 
p = .20, p = .25 and P » . 50 



Number of 
questions 

guessed 
correctly 

5 
4 

3 



Four Questions per Passage 
P=.20 P=.25 P».50 



.0016 
.0256 



.0039 
.0469 



.0625 
.2500 



Five Questions per Passage 
p«.20 p«.25 p«.50 



.0003 
.0064 
.0512 



.0010 
.0146 
.0879 



.0312 
.1563 
.3125 
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P/281-5-10-1 



- 151 

163 



probability of response options. The p = .50 column gives the 
probabilities for the hypothetical worst case described above. 

Table 4 lists the cumulative probabilities of 
guessing at least four out of five, three out of four, and 
three out of five questions correctly when the probability of 
a correct guess is either p = . 20 , p = . 25 , or p = . 50 . Table 
4 shows that the freedom to select a criterion of comprehension 
is severely constrained by the probabilities of correct guesses. 
If the criterion of comprehension is defined as at least three 
out of four questions correct, there is a risk that, in as 
many as 31% of the cases, a conclusion that students under- 
stood a passage will ''•-awn when, in fact, they were only 
guessing. If at least three correct responses are required 
when five questions are asked about a passage, a student who 
is only guessing could meet the comprehension criterion as 
often as 50% of the time. 

If we could safely assume that when the. test 
is constructed all responses will be equally probable and 
therefore the probability of a correct response will be p = .25 
if four response options are provided and P = '20 if five are 



of partial knowledge that the student may have concerning the 
correct answer. For purposes of this discussion, it is assumed 
that the student has no compiehension of the passage, but that 
he is a competent user of English and hence knows the sequential 
probabilities of the language. 
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Table 4 

The Cumulative Probability of Guessing Correctly Various 

Numbers of Questions when the Probability (p) of 
Guessing the Correct Response is : 
p=.20, pa. 25 and p«.50 

Number of 
questions 

guessed Four Questions per Passage Five Questions per Passage 

correctly p=.20 p=.25 P«.50 p='.20 p=.25 p=.50 

All .0016 .0039 .0625 .0003 .0010 .0312 

At least 4/5 .0067 .0156 .1875 

' At least 3/4 .07 ■ .0508 .3125 

At least 3/5 .0579 .1035 .5000 
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provided. Table 4 indicates that a criterion of at least 

three questions correct out of four or a criterion of at least 

7 . 

four questions correct out of five would be acceptable, in 
that either criterion would reduce to six percent or less the 
chances of crediting a guesser with comprehension of a passage. 
However, since it might happen that the empirical probabilities 
will turn out to be other than 1/N, the conservative course is 
to select at least four questions correct out of five as the 
criterion of comprehension. The hypothetical worst case would 
then lead to an erroneous conclusion only about 19 out of 100 
times rather than 31 out of 100 times, which could occur if 
p * .50 and the criterion of comprehension was set as at least 
three questions out of four correct. Table 4 makes it clear 
that providing five rather than four response options (so that 



' As a practical matter, it seems unreasonable to set perfect 
performance (four out of four or five out of five) as the 
criterion of comprehension, since students may, for a variety 
of reasons, miss an item even if they comprehend the passage: 
their attention may wanuer momentcirily, they may misread the 
response option, and so forth. 

® It might be possible to pretest all options for equal plausi 
bility, and revise them as necessary until p ■« 1/N. However, 
part of the strategy for development of the reading effective- 
ness measure is to make item construction as mechanical as 
possible. With this objective in mind it is probably more 
cost effective to plan conservatively for the possible worst 
case of p « .50 than it is to pretest all response options. 
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p = .20 rather than p = .25) reduces only slightly the likeli- 
hood that a student will meet the comprehension criterion by 
chance, when compared with the much larger reduction that takes 
place by setting the criterion as at least four questions 
correct out of five rather than at least three questions cor- 
rect out of four. Therefore it is probably not worthwhile 
to spend the increment of testing time required for a student 
to process five rather than four response options. 

c. Determining the maximum difficulty level a 
student can comprehend . After each passage has been scored, 
the data must be examined to determine the highest level of 
textual difficulty that a student can comprehend. This means 
analyzing the pattern of results over all passages to find 
the point at w'hich a student ceases to meet the criterion of 
comprehension on test passages. 

Certain clear patterns of data will be easy to 
interpret, e.g., where a student has a passing score on all 
passages up to a certain readability level, but fails all 
passages that are more difficult. However, less clearcut 
patterns of data also may be expected, such as when a student 
has passing scores on all passages up to a certain readability 
level, then alternately passes and fails a number of passages 
prior to finally failing all subsequent passages. Procedures 
need to be developed for specifying the most difficult level 



F/281-5-10-1 



- 155 - 

167 



of material that a student can read in those cases where the 

9 

data do not show an unambiguous break point. 

The validity of inferences drawn from the data 
can be put to a simple empirical test. Students can be given 
materials of readability levels that (according to the reading 
effectiveness measure results) they should and should not be 
able to comprehend. Comprehension of these materials would 
have to be demonstrated behaviorally , e.g., by followxng 
directions. If the effectiveness test data have been validly 
interpreted, the student should succeed with materials pre- 
dicted (by the effectiveness test results) to be within his 
comprehension, but should fail on materials that are predicted 
to be too hard for him. 



Probability theory will again be useful in making these 
decisions. Questions may be asked, for example, concerning 
the probability of obtaining different sequences or patterns 
of passing and failing passages on a chance bases in moving 
from the least difficult to the most difficult passages. 
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H, A Strategy for Measuring Knowledge of Words 

To ensure that the test actually measures knowledge of 
the selected test words, the responses for each item should 
be considerably (and uniformly) more familiar than the test 
word itself. It is necessary to avoid the type of item, found 
in some norm-ref ero.nced vocabulary tests, where one or more 
of the response options is less familiar than the test word. 
It is not possible from items of the latter type to draw valid 
inferences concerning a student's knowledge of words in the 
frequency band from which the test item has been sampled, 
since the student may know the test word but not its less 
familiar synonym. 

To avoid this problem, the following strategy in construc- 
ting test items could be adopted. First, a narrow interval A 
on the word familiarity scale would be selected; the interval 
should be narrow enough so that all of the types belonging to 
it are approximately equally familiar. Then, a random sample 
of, say, 20 types belonging to A would be chosen as test words. 
For each test word, four responses also would be selected, 
of which only one matches the test word in meaning. All of 
these response options should be drawn from interval B, which 
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contains words that are more familiar than the test words. 
This is shown in Fig. 4. It is unlikely that all 20 test 
words will have synonyms that are more familiar than the test 
words and that fall in a familiarity band as narrow as the 
interval A. That is why the interval B is shown as wider than 
the interval A. The exact width required in the interval B 
in order to find a synonym for all test words chosen from the 
interval A is not yet known; what is important is that the two 
intervals must not overlap. 

Items to test knowledge of word meanings can be written 
in two principal formats; the test word can either be pre- 
sented in isolation or it can be presented in the context of 
a brief linguistic frame, such as a phrase. Each format has 
its own advantage. 

Presenting the test word in isolation has the advantage 
that it does not require the student to read any connected 
text in order to answer the question; hence performance on 
items in this format should clearly reflect students' knowl- 
edge of individual word meanings. Presenting the test word 

10 

This strategy (of making response options more familiar than 
test words by selecting the options from a scale interval of 
greater familiarity) can be expected to break down when test- 
ing the commonest (highest frequency) words. By definition, 
the most familiar words will not have more familiar synonyms. 
Hence an alternative procedure, perhaps involving pictures, 
will need to be developed when testing very high frequency 
words, to assure that response options are not less familiar 
than test words. 
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CHOOSE TEST WORDS 
FROM INTERVAL A 




CHOOSE RESPONSES 
FROM INTERVAL B 



FAMILIARITY (FREQUENCY) 

FIG. 4 THE SELECTION OF TEST-WORDS AND RESPONSES FROM 
DIFFERENT INTERVALS OF THE FAMILIARITY SCALE. 
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in a brief linguistic frame has the advantage of providing 
enough context to resolve ambiguities in the case of test 
words that have multiple meanings. Since multiple meanings 
are fairly common among familiar words, and since familiar 
words will be tested, it would be useful to have a format that 
could reduce uncertainty for the student concerning which of 
the meanings of a test word is intended in the test item. 

A final decision on item format for measuring word knowl- 
edge has not yet been made. 
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I. A Plan for Computer-Assisted Test Construction 

If a computer can be used to generate tests, then it 
will be practical to produce branched tests at costs that 
will be sufficiently low to be acceptable to school systems. 
Some computer assistance in the test construction process does 
appear to be feasible. In the comprehension section of the 
•eading tests, a computer can help to generate alternative 
response options for the words that have been deleted from 
text. A computer can clearly be used to supply lists of 
words of comparable frequency to the deleted word. It is 
even possible that the list could be limited to words of the 
same form class as the deleted word. Item writers will need 
to make the final selection of response options, since 
options within an item should be matched for grammatic and 
semantic plausibility in the blank space (i.e., plausibility 
if a sentence stood by itself) , and fulfillment of that 
criterion will require human judgment. 

In the vocabulary section of the test, a computer can be 
used both to supply lists of words to be tested and to provide 
possible response options. Once a decision is made concerning 
the familiarity bands in which knowledge of words is to be 
tested, a computer can print out lists of words falling in 
the specified frequency bands. From these lists, test items 
can be selected. Once the synonym (correct option) for each 
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te.^t item has been chosen (it is not knovm at this time whether 
this is best done by hand or computer) , a computer can provide 
lists of words of comparable frequency. It may also be possible 
to restrict these lists to the same form class as the correct 
option. From these lists, selection of the required number 
of options will be made (or at least checked) by hand, since 
it is unlikely that a computer can be programmed at this time 
to do an adequate job with problems of multiple word meanings 
or connotations that could make an item ambiguous. 
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J. A Plan for Administering and Interpreting the Tests 

Tests of reading comprehension and word knowledge built 
following the procedures described in this chapter can be 
administered periodically to assess students' reading achieve- 
ment. At this stage the tests can be used only for limited 
purposes since, as yet, standards of reading competence have 
not been set. Although there is no basis, until standards 
are set, for judging whether students' achievement is "satisfac- 
tory," the tests can provide detailed information concerning 
what students know. By comparing students' performance over 
time, growth toward adult levels of reading comprehension and 
word knowledge can be measured. 

The tests would be graduated in difficulty. Passages to 
test reading comprehension would gradually increase in dif- 
ficulty from the lowest level of readability to the level of 
the most difficult materials found in the corpus. Vocabulary 
tests for the youngest students would start by testing knowl- 
edge of the highest frequency words, and i;ests for older 
students would gradually add words in the moderate and low 
frequency bands. 

Scores on the comprehension test would indicate the level 
of readability of materials which a student can read with 
understanding. The scores could be interpreted for parents 
or teachers by illustrating the kinds of materials that a 
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student ought to be able to read, given his test performance. 
As a student gets older, his test performance could be related 
directly to adult tasks. Scores on the word knowledge test 
would be interpreted by estimating the size of a student's 
vocabulary in various frequency bands, and comparing his word 
knowledge profile to word type distributions in adult materials. 



While the purpose of these tests is to provide data that are 
directly interpretable with respect to students' rea^ling 
capabilities, it is a simple matter to provide a norm-referenced 
interpretation for these test scores if such an interpretation 
is desired, by accumulating sufficient test data and reporting 
the distributions of scores per grade. 
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Chapter V 

Application of the Design Concepts for Quantifying 
English Text in Setting and Monitoring Standards 

In the previous chapter, the application of the reada- 
bility formula and word frequency distributions to the con- 
struction of tests to assess students' current reading 
achievement and their progress toward attaining adult levels 
of reading skill was discussed. This chapter will show how 
the quantitative scaling of English text in terms of reada- 
bility and word familiarity would facilitate the formal 
evaluation of reading effectiveness by providing relevant 
input data for setting standards of adult reading competence, 
and how this scaling can serve as a basis for measuring 
students ' progress toward the attainment of those standards . 

This chapter will also show how the readability and 
word familiarity measures could be used in analyses of 
instructional materials to determine whether or not the 
readability and vocabulary of those materials may account 
for students' failure to meet standards of adult reading 
competence. These analyses could result in recommendations 
for more rational readability and vocabulary demands on 
students which, if implemented, could lead to an increase in 
system effectiveness in reading. 
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A. Input Data for Setting Standards 

Setting standards of adult reading competence requires 
value judgments to be made concerning the capabilities that 
adults need to acquire. Such value judgments are the respon- 
sibility of government. While scientists have no direct role 
to play in making "should" or "ought" statements concerning 
adult reading capabilities, they do have an important role 
to play in providing those empowered to set standards with 
the technical data upon which informed decisions should be 
based. Analysis of English text in terms of readaUDility and 
word familiarity makes it possible to provide government with 
information in a form that enables standards to be set in 
precise, quantitative terms. 

1 , The quantitative display of adult English text . 
Suppose that the readability and word frequency characteris- 
tics of adult materials have been scaled, using periodicals 
to define the corpus of adult reading materials. The basic 
display of input data to be provided, for example, to the 
legislative branch of government should familiarize its mem- 
bers with the range and general meaning of scale values found 
in adult reading materials. This could be done by displaying 
the scale values of all the analyzed materials, clustered by 
difficulty and word frequency characteristics. Such a dis- 
play would reveal the functional meaning of differences in 
scale values. For example, it might be seen that a cluster 
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at the low (easy) end of the scale Includes such materials as 
the Readers' Digest ^ Daily News ^ pulp fiction magazines, '^t^s. 
The upper end of the scale might include physxcs journals, 
literary criticism magazines, etc. This display should help 
anchor scale meanings for legislators in terms of materials 
which are familiar to them. 

2. The difficulty and familiarity of the reading 
requirements imposed by New York State . Once the entire 
difficulty range has been arrayed, legislators . 4y ask for 
the scale values of any set la) of materials ui.i . consider 
relevant to setting standards. It can be anticipated, for 
example, that one set of materials whose scale values would 
constitute useful input data would be those materials that 
the State of New York expects its citizfns to read. This 
would include such diverse materials published by state 
governmental agencies as tax forms, driver's license appli- 
cations, official notices, advisory information on a variety 
of subjects, etc. Since publication of these materials 
implies ti^at the government currently expects citiiiens to be 
able t.o read them, the readability and word frequency, char- 
acteristics of these materials are highly relevant to the 
setting of realistic standards of adult reading competence. 

The scaling of these materials would allow legis- 
lators and others in government to see how the reading tasks 
placed on citizens by the state compare with the full range 
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of adult materials. If any reading tasks considered essential 
should fall at the upper (difficult) end of the scale, govern- 
ment would have the choice either of setting standards at 
that high level, or of directing that an attempt be made to 
simplify the materials to some lower level that was specified 

as the standard. 

However, the extent to which materials can be sim- 
plified is constrained. The heart of the problem in simpli- 
fying text is that vocabulary difficulty is the single most 
important factor affecting readability, aj\d, ift at least 
some cases (such as the insurance policies cited above, or 
tax forms, legal documents, etc.), the vocabulary required 
to convey essential concepts cannot be replaced by more 
common words. Thus, although some sentences may be made less 
complex and some simplification of vocabulary may be possible, 
it is unrealistic to expect that the difficulty of all 
essential or important reading materials could be reduced to 
elementary reading levels.^ 



This fact is illustrated in a recent attempt by Pennsylvania's 
insurance commissioner to increase the readability of insur- 
ance policies. Using Flesch's prescriptions for producing more 
readable writing by manipulating such features as word length 
and sentence length. Blue Shield succeeded in changing the 
readability scores of a Medicare policy only slightly^ from 
26.8 to 35.0 (or about 8%), on a scale from 0 (very difficult) 
to 100 (very easy) ( New York Times , July 8, 1973). 
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3. Scaling other pertinent materials . Since the mate- 
rials issued by New York State may not cover all of the 
important reading tasks that adults need to carry out, 

gc vernment might request additional input data. Once word- 
type probabilities have been determined, and more is known 
about the range and relative frequency of different levels 
of readability in the corpus constituted from periodicals, 
any materials can be analyzed and located on the readability 
and word familiarity scales. Thus, if they wished, persons 
in government could receive data concerning the scale values 
of other reading tasks that have been considered important 
by presumed experts, such as the tasks in the Harris Survey, 
in the Educational Testing Service collection, or in the 
Adult Performance Level Study. At their request, they might 
also receive information on the comparative readability and 
word frequency characteristics of entry level materials in 
various fields, e.g., what levels of reading skill are 
required to read introductory texts in automotive mechanics, 
business c dministration , chemistry, etc. 

4. Forecasting future reading requirements . In 
gathering input data for setting standards, government may 
also wish to receive information concerning the reading 
requirements that citizens will need to meet fifteen to 
twenty years in the future. A minimum forecast of twelve 
years would be required, since it takes that long for a 
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student to complete his schooling. The program of reading 
instruction that a student receives from the time he enters 
school should be geared to preparing him for the standards 
he will need to meet, at the time of high school graduation, 
in order to enter college, technical school, or an apprentice- 
ship program, to get a job, or to function effectively in 
the adult world in general. 

The need for forecasting is based on the assumption 
that adult reading requirements will be different in the 
future from what they are today. Linguists (e.g.. Fries, 
1962) have provided considerable information that language 
is constantly changing. However, such changes in language 
are slow to occur. A much more potent force producing short- 
term changes in the difficulty of adult reading materials 
results from the fact that our society is rapidly changing 
its requirements for citizenship and work. 

An informal analysis of reading requirements sug- 
gests that the materials that people need to read in order 
to function effectively as adults hav'3 increased in quantity 
and in difficulty over the last few decades. In order to 
use the sophisticated machinery and the array of new products 
developed in recent years, adults must read more than they 
had to in earlier times. Consumer trends, such as increased 
use of credit agreements, checks, etc., also mean that the 
public must do more reading. Each year there are fewer jobs 
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that can be filled by persons with limited reading skills. . 
These changes and others have resulted in a sharp increase in 
the reading demands on adults over the last generation. Since 
there is no reason to expect society's rapid rate of change to 
decline in the near future, the reading demands that will face 
the next generation of adults will probably differ from those 
facing current high school graduates. Therefore, to do an 
adequate job of educating students to be competent adult read- 
ers, we need to be able to forecast the reading skills that a 
student entering first grade now will need by the time he is 
a graduating twelfth grader. 

The design concepts for quantifying language phe- 
nomena appear to make forecasting feasible. The general 
strategy for forecasting would be straightforward: in any 
field, materials would be sampled to establish the mean and 
variance of their readability at several historical points 
in time. A curve would be fit to the data, and from this 
curve extrapolations could be made to future points in time. 
This procedure could be followed for materials that citizens 
are required (by law) to read, and for materials adults need 
to read for their own well-being, as well as for materials in 
any occupational area. If government wanted to make the fore- 
cast data more sophisticated by weighting such factors as 
the future prospects for an occupation, or the anticipattd 
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importance of various reading tasks, methods for quantifying 
such trends may have to be developed and weighting procedures 
would certainly need to be formulated. 

5. Using input data to set standards . After examining 
all input information that it considers relevant, government 
can set terminal standards of reading comprehension and 
vocabulary for students in New York State, based on the 
readability and word frequency characteristics of reading 
tasks it considers essential or important. These would 
probably be system-wide rather than individual standards. 
That is, the standards would probably define a floor or 
minimum level of reading competence for any student being 
processed by the educational system; they would probably 
not purport to specify the standards that any particular 
individual should strive to achieve. 

When standards are set on the basis of information 
provided by analyses of the readability and word familiarity 
of various reading materials, they lead to precise definitions 
of what the educational system must produce. For example, 
if New York State sets as a standard that all graduating 
students be able to read income tax forms , then schools must 
turn out students who can read materials of difficulty level 
X, and who know Y percent of common words, Z percent of rare 
words, etc. 
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6 . Standards may be tentative until costs are known . 
Since the present state-of-the-art does not make it possible 
to calculate the costs associated with attaining various 
standards, government might wish to regard the standards it 
sets as tentative, pending information on how attainable they 
are, and at what cost. It may, however, be a long while 
before this information can be provided. In order to relate 
costs to the attainment of standards, the time (and there- 
fore the costs) required for teachers to achieve various 
program objectives must be known. Furthermore, it is neces- 
sary to understand how achieving program objectives affects 
progress toward the attainment of standards (i.e., operation- 
ally, how achievement of program objectives affects perfor- 
mance on effectiveness measures) . Finally, the consequehces 
of achieving particular objectives must be known not only 
for short-term progress toward standards, but also for the 
long-term or ultimate attainment of standards. 

Most of the complex information needed to accom- 
plish this linking of time (costs), objectives, and standards 
is not presently available. For example, one prerequisite 
is programs that are fully described with respect to objec- 
tives and the procedures used for meeting those objectives. 
Since educational systems currently do not formulate precise 
program documentation, the linking of costs and standards 
does not appear likely in the near term. 
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B. Measuring and Displaying Effectiveness in Reading 

With standards of adult reading competence defined, it 
becomes possible formally to evaluate whether educational 
systems (the state as a whole, districts, etc.) meet those 
standards. As noted in the preceding chapter, reading achieve- 
ment can be measured even though standards have not been set, 
but a formal judgment cannot be made concerning the adequacy 
of achievement until there are standards against which to 
evaluate performance. Once minimum terminal objectives in 
reading have been specified, a Cxl'^'^ri-vn exists against which 
the success of educational systems can be evaluated. 

1, Measuring attainment of standards . System effec- 
tiveness would be measured by administering tests of word 
knowledge and reading comprehension to high school seniors to 
determine whether or not their knowledge of words in various 
frequency bands and the readability of materials they can 
comprehend meet the levels specified by the standards. 

2. Measuring progress toward standards . On the hypothe- 
sis that system effectiveness in reading were actually measured 
and the results reported, several questions can be anticipated 
in connection with interpreting and understanding the findings, 
particularly if system effectiveness were generally below 
minimum standards. For example, how do students move toward 
the standards over the grades? Does progress increase mc'^o- 
tonically over the grades, or does it essentially level off 
after, say, grade six? 
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To answer such questions, it should be possible to 
measure students' progress toward adult standards by adminis- 
tering one or more tests in which readability and vocabulary 
vary from the very simplest levels up to the level of adult 
standards. The results of periodic testing could be related 
to the standards to determine whether students were progres- 
sing toward adult standards and to describe how far they had 
progressed. 

While student progress toward adult standards could 
be described, it would not be possible \at least in the near 
term) to evaluate the progress that w&s made, i.e., to say 
whether or not it was adequate. Since grade level standards 
of reading competence have not been established, there is 
presently no basis for judging that a particular level of 
reading skill in grade X is satisfactory or unsatisfactory 
with respect to reaching adult competence by the twelfth 
grade . 

3 . Measuring attainment of de facto grade standards . 
Although student progress cannot be foriually evaluated rela- 
tive to the ultimate attainment of adult standards, it is a 

The tests of graduated difficulty described here are simi- 
lar to those described in the previous chapter, where the 
discussion concerned assessment procedures. The principal 
difference lies in the criterion used to interpret test data: 
in the present instance growth is measured against specific 
standards, whereas in the previous chapter growth was related 
to adult reading in general. 
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straightforward matter to determine whether or not students 
are meeting the de facto standards that are operative in 
grades 1-12. These de facto standards can be defined by 
employing the readability formula and knowledge or word 
familiarity (based on analysis of the corpus of adult peri- 
odicals) to characterize the instructional materials used in 
each grade . 

In order to do this, the instructional materials 

used in each of the grades would first need to be sampled, 

3 4 

thereby constituting several grade level corpora. ' To 
determine grade level expectations (de facto standards) for 
readability, the formula would be applied to the corpora, and 



The grade level corpora complied by Carroll, Davies, and 
Richman (1971) would not be suitable because: (a) they lack 
information for several grades; (b) they may not accurately 
represent instructional materials used in New York State; 
and (c) the reported word frequencies do not take size of 
readership into account. 

^ In defining grade level corpora of instructional mateft»ls, 
it is probably unnecessary to analyze the readability and 
word type distributions of every book used ^he schools. 
Books that are used for supplementary study c. . resource pur- 
poses in the classroom may be excluded, since many of them 
are used by few rather th?n all students, and thus do not 
reflect stable reading demands that schools are placing on 
students . / 

Library books should also be excluded, since it may be 
impossible to estimate accurately how many students use them 
or the grade (s) in which they are used. 

On the other hand, limiting the corpus exclusively to 
reading textbooks would constitute too narrow a definition 
of the reading demands that are placed on students . To 
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the mean and standard deviation of the materials used in each 
grado would be calculated. To determine grade level expecta- 
tions for word familiarity, the vocabulary used in each grade 
wjuld be characterized in relation to the familiarity of words 
previously found .in the analysis of adult materials. Once the 
de facto readability and word familiarity standards of each 
grade were defined in this manner , it would be possible to 
analyze students' performance on the reading effectiveness 



function effectively, a student must be able to read, in addi- 
tion to his formal reader, at least the contents of his math, 
social studios, science, and health books. Since use of these 
books is required of practically all students, their reada- 
bility and vocabulary should be considered a legitimate part 
of the reading demands placed on students in the grades and 
programs in which they are used. 

The infornation needed to construct the instructional corpus 
may be collected through a survey, asking which books are used 
as required texts in different schools in reading, math, sci- 
ence, health, and social studies, and in which grades those 
books are used. Since the number of different textbooks in 
these curriculum areas is limited relative to the number of 
schools, a survey of a stratified random sample of schools, 
rather than a survey of every school in New York State, should 
be sufficient to net all major texts. 

After the corpus of instructional materials has been identi- 
fied, readability and word type distributions would be calcu- 
lated by grade, program, and content area. Weighting proce- 
dures would be used to take account of differences in the num- 
ber of students using the materials. 
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measure to determine whether these grade level standards were 
being met.^ 



It is apparent that the proposed measure of effectiveness 
in reading could also be used to determine whether standards 
<)ther than those which may exist in a de facto sense at 
various grade levels are attained. If reading programs 
specified that they expected students to know vocabulary of 
a givon familiarity or to be able to read at a given diffi- 
culty level, then whether or not a program was meeting its 
objectives in year Y, Y+1, y+2, etc. could be determined by 
testing to see whether students had acquired the word know- 
ledge and reading comprehension levels set as program objec- 
tives . Furthermore, with the appropriate stgitistical designs, 
it would also be possible to determine comparative program 
effectiveness, e.g., among programs A, B, C, and D, at the 
time of completion, and also to determine the comparative 
long-term payoff of the programs N years after completion. 
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C. Analyses of Effectiveness Data 

In the event that administration of effectiveness wja- 
&ures revealed that students were not meeting standards, it 
would be important for senior managers at the State Education 
Department to identify the factors contributing to the less- 
than-desired levels of effectiveness. Logically, the current 
problem would have to be defined betore alternatives for 
improving system performance could be formulated. The reada- 
bility and word familiarity design concepts make possible a 
number of analyses that should contribute to an Mtid>irstanding 
of why standards are i.jt being met, and should indicate pos- 
sible courses of action to correct the problem. While the 
types of analyses outlined below would not be sufficient to 
identify all the factors contributing to students' failure to 
attain standards, they should make a distinct contribution to 
a good definition of some of the problems and alternative 
solutions . 

1 . Multivariate analyses of effectiveness in reading . 
Information concerning whether de facto grade standards were 
being met could be analyzed in conjunction with information 
concerning the pattern of progress toward adult standards 
over the grades in order to understand why any target group (s) 
or school system (s) were attaining a given level of effec- 
tiveness by the twelfth grade. Such a study would enable 
SED to isolate problems and consider alternative solutions . 
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For example, let us assume that a problem defini- 
tional study showed that all (black, white, and Spanish- 
speaking) middle class students attained the minimum adult 
standards by grade ten; all lower class (white, black, and 
Spanish-speaking) students did not attain minimum standards 
by grade twelve. Let us further assume that the de facto 
grade level standards increased monotonically from grades 
one through five and then leveled off so that the rate of 
growth in ;vxpected reading skill was less for grades six 
through twelve than for grades one through five; furthermore, 
by tenth grade, de facto standards had reached the level of 
adult standards. Continuing with our hypothetical example, 
let us assume that all middle class students met or exceeded 
grade level standards in all grades, but that, starting in 
grade three, all lower class students did not attain grade 
level standards. It then follows that it would be reasonable 
to inquire into the feasibility of reducing, for lower class 
students, the rate by which the standards increased in grades 
one throigh five, increasing the rate in grades six through 
twelve, and setting the present de facto grade ten standard 
as the expected value in grade twelve. Such changes might 
reasonably be considered in order to better insure that lower 
class students attained adult standards by graduation. 

2. Analyses of instructional materia ls. The reada- 
bility and word familiarity n.e isu -es could be used in analyses 
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of Instructional materials designed to locate possible pro- 
blems in the readability and vocabu3.ary learning demands 
placed on students. Once such problems were identified, 
categorical responses designed to alleviate them, and hence 
to increase effectiveness, could be suggested to senior man- 
agers of the State Education Department. 

The analyses that are illustrated below do not, by 
any means, exhaust the analyses of instructional materials 
that might be carried out in looking for sources of failure 
to meet adult standards. For example, the question of the 
consequences of alternative methods of teaching reading on 
attaining standards is not touched upon. Rather, the illus- 
trations are limited to analyses that are related to the 
readability and word familiarity measures. 

a. Disorder (chaos) in instructional materials . 
To RRI's knowledge, no learning theorist or educator has ever 
found that chaos contributed to learning. Yet there is rea- 
son to believe that the vocabulary of instructional materials 
creates a chaotic set of inputs for students. 

Evidence presented by Stauffer (1966) indicates 
that reading programs differ concerning which words are intro- 
duced in which g: les. If there are major differences between 
the vocabulary in various sets of instructional materials, 
there should be difficulty in maintaining a logical sequence 
of instruction. Whenever a school changed books or whenever 
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a student, changed schools, students would be apt to encounter 
a great number of words not previously learned or, alterna- 
tively, be asked to learn words already mastered. The prob- 
lem of encountering a great number of new words could be 
expected to be most serious for those students who are most 
dependent on the schools for what they learja, presumably 
educationally disadvantaged students. 

On the hypothesis that construction of grade- 
level corpora were undertaken, the data would be available 
with which to determine the degree of similarity across 
reading program materials from different publishers with 
respect to the vocabulary and readability introduced in each 
of the grades. If a display of the overlap of readability 
and of vocabulary across programs confirms that, in fact, 
there is appreciable chaos (i.e., little overlap), corrective 
action would be indicated. A logical step to correct the 
problem would be for the State Education Department to direct 
publishers that there be (at least) a minimum specified 
amount of overlap in vocabulary and readability across reading 
programs . 

b. The effect of the amount of learning expected 
per time unit (load) on cognitive development . There is 
evidence to suggest that the vocabulary load that is being 
placed on students is not being carefully planned or con- 
trolled. First, there is evidence suggesting that publishers 
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do not coordinate vocabulary across curriculum areas. 
Stauffer's (1966) comparison of reading, arithmetic, health, 
and science books showed little overlap in each grade between 
the vocabulary words introduced in the books used in the 
different subject areas of the curriculum. For example, he 
found that while 2153 new words were introduced in seven 
reading series in the third grade, and 2150 new words were 
introduced in three arithmetic series in that grade, only 421 
words were common to both lists. Moreover, many words 
appeared in textbooks in different subject areas which did 
not appear in any of the seven reading series at any grade 
level. Stauffer has estimated that even if a student some- 
how had the opportunity to learn the vocabulary of all seven 
reading series, he would learn only half the words he would 
encounter in his arithmetic books. 

If these data are correct, the effective 
reading load on pupils is considerably heavier than the 
requirements made by formal reading programs (and, for that 
matter, heavier than the requirements made by typical, college 
level, foreign language courses) , since students must also 
learn much new vocabulary in the subject areas. Furthermore, 
since new vocabulary is seldom identified as such in text- 
books, other than basal readers, there is a good chance that 
teachers will not formally teach these new words, thereby 
leaving the learning of much critical vocabulary entirely up 
to the student. 
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There is also evidence suggesting that the 
reading load presently placed on students may not be evenly 
distributed over grades. Stauffer's (1966) dat,< show a 
geometric increase in the number of new reading vccabulary 
words introduced per grade, in grades one through thrre. 

RRI believes that loading rate shoulc have 
important effects on reading achievement since a related 
variable, massed versus distributed practice, is a classic 
learning variable of considerable significance. However, 
the actual relations between loading rate and learning to 
read are unknown, and would need to be clarified in a program 
of research before the load currently found in instructional 
materials could be evaluated. 

Once the relationships between loading rate 
and learning to read were known, that knowledge could be 
used to evaluate current loading practices in terms of whether 
or not they promote efficient learning. For example, it could 
be that the vocabulary load of reading programs is about 
right for efficient learning but that, in some grades and 
for some types of learners, the added burden of vocabulary 
in the subject areas creates an overload. 

A possible outcome of such analyses could be 
directives from the State Education department to publishers 
governing the number of new words to be introduced per grade, 
and the coordination of vocabul'Uy across curriculum areaj. 
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c. The effects of difficulty on learning . The 
readability formula and word frequency distributions could be 
used in analyses designed to determine whether the difficulty 
level of instructional materials being used by students might 
be responsible for their failure to attain adult standards of 
reading competence. 

It is well established that difficulty is a 
critical variable in learning. When materials are too easy 
for students, boredom results. When materials are too dif- 
ficul'/ students may become frustrated, "tune out," and 
develop negative attitudes toward learning. In some cases, 
students may even develop negative attitudes toward themselves 
(e.g., "I am not capable of learning") or set unrealistically 
low achievement aspirations in order to reduce the impact of 
the failure experiences they are likely to have when asked to 
learn materials that are too difficult for them. In short, 
when materials are either too easy or too hard, students do 
not learn efficiently and the risk that educational processes 
will produce unwanted outcomes increases. It is for these 
reasons that controlling the difficulty of instructional 
materials has been a traditional concern of educators. 

As long ago as 1917, Thorndike (1917) suggested 
that teachers not use materials for instructional purpoE^a 
unless a child could correctly answer 75% of the comprehen- 
sion questions asked of him about the materials after he had 
studied them. Many well known writers in the field of 
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education have since echoed Thorndike's suggestion in writing 
for teachers. The long history of interest in measuring reada- 
bility is another reflection of educators' desire to present 
students with materials of a suitable level of difficulty* 

A recent study by Bormuth (1968b) clearly indi- 
cates that the amount learned from studying instructional 
materials is a function of the initial difficulty of the 
materials for the student. After matching pairs of students 
for reading skill, Bormuth had one member of each pair take a 
cloze test over a passage to measure his comprehension of the 
passage, while the other member of the pair answered questions 
about the passage prior to and after reading it. He corre- 
lated cloze scores with information gain, which was defined 
as the increase in the number -r questions answered correctly 
after reading the passage Bormuth found (Fig. 5) that stu- 
dents gained very little information from studying difficult 
materials (cloze scores equal to and below 25%), and that 
studying easy materials (cloze scores equal to and higher 
than 37%) resulted in only very slight increases in infor- 
mation gain. 

Bormuth 's data indicate that there is a level 
of difficulty — neither too easy nor too hard — over which each 
increment in difficulty will result in a corresponding incre- 
ment in information gain. As Fig. 5 shows, information gain 
for Bormuth 's subjects was a monotonic function of difficulty 



F/281-5-10-1 



- 186 - 

198 




10 
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CLOZE READABILITY PERCENTAGE SCORE 



FIG 5 EIGHT DEGREE POLYNOMIAL CURVE FITTED TO THE REGRESSION 
OF EACH PAIR'S INFORMATION GAIN SCORE ON ITS CLOZE 
READABILITY SCORE. ( SLIGHTLY MODIFIED FROM BORMUTH, I968W 
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only over the range bounded by cloze scores of 25 to 37 per- 
cent . ^ 



As Bormuth notes, however, we cannot assume 
that the learning curve found for his college subjects would 
hold for all learners over all materials. Rather, the 
results in Fig. 5 should probably be taken as illustrative 
of a more complex set of relationships; it may be assumed 
that they represent only one of a family of curves that would 
result if the relationships between learning difficulty and 
information gain were studied for students of different ages,- 
abilities, and backgrounds, as well as for different types of 
of materials and amounts of study time. Further research would 
be required to define this family of learning curves before the 
the appropriateness of the difficulty level of particular in- 
structional materials for particular students could be evalu- 
ated. ^ 



It is interesting to note that Thorndike's (1917) sugges- 
tion that teachers use moderately easy materials (at least 
75% comprehensible) was a good estimate (see Bormuth [1968b] 
for how the 75th and 90th comprehension percentiles were 
translated into cloze scores). It appears from Fig. 5, how- 
ever, that, in terms of cloze scores, a more difficult aver- 
age level of difficulty should govern the material presented 
to students . 

^ To facilitate later use of the data, difficulty in such 
research should probably be defined in terms of readability 
discrepancy (i.e., the difference between the readability of 
the new materials to be learned and the readability of mate- 
rials the student can currently comprehend) rather than in 
terms of cloze scores. 
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Once these relationships between difficulty 
and learning were established, it would be possible to deter- 
mine whether unsuitably difficult instructional materials 
(either too easy or too hard) were associated with students' 
failure to meet standards. On the hypothesis that effective- 
ness measures were regularly administered to students, the 
readability levels that students could comprehend would be 
known. The readability of instructional materials used by 
students could be scaled with the readability formula. It 
would then be a straightforward task to determine the dif- 
ficulty of the materials for students, and to evaluate that 
difficulty in terms of the learning curves that had been 
established. 

It might be that some students were being 
asked to use instructional materials whose readability was, 
according to the learning curves, much too difficult for 
efficient learning. In surveys conducted among a nationally 
representative sample of eleventh grade students in 1960 and 
1970, 38% and 33% of those interviewed, respectively, reported 
that, at least half hhe time, "I read material over and over 
again without really understanding what I have read" (Flanagan 
and Jung, 1971) . 

In the elementary grades, the problem of exces- 
sively difficult instructional materials is likely to be most 
s<i)/ere in subject areas of the curriculum, since teachers have 
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greater latitude to match basal readers to students' reading 
capabilities than they have to assign science, math, or 
social studies texts that are in accord with students' reading 
skills. Normally, subject area texts are determined on the 
basis of topics to be covered in a particular grade, and 
these topics are considered to be fixed, independent of stu- 
dents ' reading abilities . 

To promote optimum learning, the recommendation 
might be made that teachers take steps to provide all stu- 
dents with materials of suitable difficulty. Development of 
the readability formula and determination of the optimal 
difficulty levels for learning should provide practical tools 
to assist teachers in doing so. While traditional readability 
formulas have been viewed as a means for helping teachers to 
judge the suitability of materials, the absence of clear 
guidelines concerning how to employ them (i.e., what reada- 
bility level should be chosen for instructional purposes j.or 
a child currently reading at level X) has effectively pre- 
cluded their use as a means for routinely selecting instruc- 
tional materials . 

Application of the guidelines proposed by 
Thorndike (i.e., that materials are suitable if 75% of 
questions are cinswered correctly) is so cumbersome that it 
is highly unlikely that teachers have made use of these 
guidelines on any regular basis. Bormuth's (1968b) suggestion 



F/281-5-10-1 



190 - 

202 



that cloze tests be substituted for multiple choice questions 
in determining the suitability of materials hardly seems to 
make the teacher's job much easier. By contrast, an accu- 
rate readability formula (for sca.ing all materials used in 
schools) coupled with knowledge of optimum difficulty levels 
for learning should make it possible to supply teachers with 
the information they need to determine the suiteU^ility of 
new materials for learners on a routine basis. 

However, there is a possibility that the prob- 
lem of inappropriately difficult instructional materials 
may not be readily solved. Suppose that the readability of 
science materials used in the fifth grade were too hard for 
QOToe fifth grade students. An obvious recommendation would 
be that more readable materials be provided. This might 
require that publishers be directed to produce materials 
covering essentially the same content, but at easier levels 
of readability. However, as noted earlier in this chapter, 
the vocabulary required to convey essential concepts limits 
the extent to which the readability of materials covering a 
particular subject matter can be simplified. Thus it may 
not be realistic to expect that all the needed instructional 
materials could be produced even if the effort were made. 

If materials of appropriate readability (and 
suitable content) could not be provided to students, it would 
pose a clear problem for the efficient management of instruction. 
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Traditional solutions of educators to the problem presented 
by students who cannot read the materials designated for 
their grades have included remedial instruction (to try to 
bring students to the point where they could profitably use 
grade level materials), non-prcraotion, ability grouping, and 
reduced class size. However, since the effects of these and 
other managerial policies on learning are uncertain, a recom- 
mended course of action for the State Education Department's 
managers is not evident at this time. The problem is a com- 
plex one that will require study. 
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V. 



D. A Suitable Measure and a Means for Change 

The approach to the measurement of effectiveness in 
reading that has been described in this report differs in 
two significant respects from other ^ttemptn to measure sys- 
tem performance in reading. First, the proposed approach 
should result in tssts meeting all the functional specifica- 
tions for an effectiveness measure outlined in Chapter I. 
The report shows how the design concept3 for characterizing 
language in terms of readability and word familiarity would 
contribute to setting precise quantitative; standards of adult 
reading competence, and how they would make it possible to 
build reliable, valid, and clearly interpretable measures to 
ascertain students' progress toward and attainment of those 
standards . 

Second, the approach described in this report should 
make it possible to improve system effectiveness through a 
chain of interrelated steps involving many of the key actors 
in the educational process: teachers, educational manag?*rs, 
and publishers. For example, it has been shown that, if 
system effectiveness were below par, the readability and word 

» 

familiarity measures could be used in analyses to identify 
possible problems in the learning demands placed on students. 
These analyses might lead to recommendations for changes in 
the vocabulary and readeOsility content of instructional mate- 
rials in different grades. Since the recommended changes 
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would rationalize learning demands, they should increase the 
likelihood that students will ultimately meet adult standards 
of competence in reading. The measures could be used to 
verify that the designated changes were implemented and to 
determine whether the anticipated changes in achievement 
occurred. If they did not, the process could be recycled. 

The prospect of using measurement as one step in a chain 
to increase system effectiveness clearly sets the proposed 
effectiveness measure apart from other attempts to assess 
system performance in reading. Other assessment procedures, 
such as norm-referenced tests, seldom if ever result in the 
identification of problems or in recommendations for change, 
because they lack mechanisms for examining test results in 
relation to instructional materials. By contrast, tlie 
strategy for measuring effectiveness in reading that has 
been proposed in this report could quite possibly lead to an 
increase in system effectiveness when fully implemented. 
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APPENDIX 

AN INTRODUCTION TO TYPE-TOKEN MATHEMATICS. 



THE LOGNORMAL DISTRIBUTION 



In this appendix we consider the problem of describing a 
large vocabulary, such as that of the English language, in 
mathematical terms. We also address the problem of estimating 
the true statistical parameters of such a vocabulary from the 
properties of actual samples. 

Notations and Terminolocrv 

Let us regard the vocabulary, or lexicon, as a set ^ 
containing a lai.ge number 4> of types (different words). 'iTiis 
set^ is then the total theoretical reservoir from which users 
of the language must draw in speaking or writing. 

Each type T in the set ^ is assumed to have a certain the- 
oretical probability of occurrence, denoted by 7r(T), or simply 
by TT . Thus, in a very large sample, say of N tokens, we 
should expect very nearly Ntt of these tokens to be instances 
of the particular type T. 

Now let us subdivide the entire lexicon 5 iJ^to disjoint 
classes, on the basis of probability: 



on. We may as well suppose that the numbering has been done 
in order of increasing tt , so that tt. is the smallest, and 




(1) 



Thus, ?S 1 consists of all types having probability ir^ , while 
'B 2 consists of all types having probability tTj , and so 
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n-„ the largest. 

For each integer i ( i=l , 2 , . . . ,M) , let 0^ denote the 
number of types in the class TS^ . Then the fraction 

= is the proportion of types in that class. We may 

then Lum these , to obtain the cumulative proportion of 

types having probabilities < tt^ : 

1 

K-1 

Obviously we have * ^® "^^^ ^ ^ ' ^° ^^^^ 

AA^ ^ "^i-1 " ^i ^ (i=l,2, . . . ,M) . 

The above expression A^^ is actually a distribution 
function, in the sense of general probability theory. To be 
specific, consider the experiment of choosing a type T at 
random from the lexicon and observing its probability 7r(T) . 
The value observed is then a random variable, and A^ is its 
distribution function; that is, A^ is the probability that 
the observed probability will not exceed tt^ (The double oc- 
currence of the word "probability" in the last sentence is a 
source of some confusion until one gets accustomed to it. ) We 
shall frequently refer to A^ as the type distribution . 

There is also a token distribution, corresponding to a 
different random variable, obtained from a different experi- 
ment. This time we select a token at random from the entire 
written language, and observe the probability of the type re- 
presented by the selected token. If we let denote the 
proportion of tokens accounted for by the types in clasB *IS^ , 
then the token distribution isj 
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1 




(3) 



In this appendix, starred symbols will always refer to the 
token distribution, while the corresponding unstarred symbols 
wxll refer to the type distribution. 

The two distributions describe quite different phenomena, 
as can be seen by considering approximations to the two ex- 
periments sketched above, for the first case (the type distri- 
bution), we might open the Oxford English Dictionary at random, 
and select one of the entry-words. For the second case (the 
token distribution), we might choose a book at random from the 
Library of Congress, and select any vord from it. It should 
he obvious that the results will differ: the first experiment 
will most often produce a word from the middle of the proba- 
bility range, while the second will most often produce a word 
from the high end. 

Some Basic Relationships 

While che type and token distributions describe different 
phenomena, as just explained, they are mathematically related 
to eac\ other. In fact, either of them can be derived from 
the other. Tc see this, consider the class . There are 
0^ types in this class, and each of these types has proba- 
bility TT^ , which means that each of these 0^ types should 
account for the proportion of the tokens in a large 

sample. Hence the proportion of tokens accounted for by all 
the types in ^ . is: 




(4) 
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and so we have: 



1 



1 




1 




(5) 



Thus, a knowledge of the type distribution enables one to 

calculate the token distribution A^ . 

Another relationship of considerable importance is a 
dir«rt consequence of (5), obtained by setting i=M. We observe 
that Am = 1 , and hence we must have: 



This means that the total theoretical vocabulary size can be 
calculated from knowledge of the type distribution. 
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We may now combine Eqs, (5) and (6), to obtain an alter- 
nate version of (5) which reveals the token distribution to be 
precisely the so-called "first-order moment-distribution" of 
the type distribution: 



A. = 



K=l 



i M 



K=l 



(7) 



The Incidence Numbers 

In a sample of N tokens, we may expect a certain number 
Fi of types to occur exactly once each, a certain number F2 
of types to occur exactly twice each, a certain number F3 of 
types to occur exactly three times each, and so on. We may 
-».lso expect a certain number Fq of types not to occur at all 
in the sample. We shall refer to these numbers Fq, F^ , 
Fj, ... as the incidence numbers . We now consider the problem 
of calculating the incidence numbers, assuming we have knowledge 
of the true theoretical type distribution . 

Of course, if the sample size N were very large indeed 
(say, in the trillions or quadrillions), then the problem 
would be quite simple. Each type would then occur almost ex- 
actly as often as its true probability dictates, and so the 
incidence numbers would coincide with the numbers 0 ^ . In 
actual practice, however, the sample size N will not be nearly 
large enough for this simple approach; we must therefore pro- 
ceed differently. 
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Consider a particular type belonging to the class 
Since this type has true probability tt^ , the probability 
that this type will occur exactly j times in a sample of 
size N is given by the well-known "binomial" formula: 



(«).i(l-Tr,) 



Now, there are 0^ types in class , and so the number of 
types from class ?5i which may be expected to occur exactly 
j times in a sample of size N is the product of the preceding 
expression by the number : 



(XT \ • N-j 

(8) 

(XT \ * N-j 



= ^ 



Finally, the total number of types from all classes which may 
be expected to occur exactly j times in a sample of size N 
is obtained by summing (8) over all values of the index i . 
Thus we may calculate the incidence numbers; 



M 



i=l 



(9) 



M 

N-j 



= *(j)2}i'(^-^) " ^^i 



i=l 
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Checking for Internal Consistency 



Formula (9) may be checked in two ways. In the first 
place^ the sum of all the incidence numbers should agree 

with the total number of types: 



j=0 i=l j=0 

M 

i=l 

= 4) 



In the second place, the sum of all the products jF^ should 
agree with the total number of tokens: 
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M N 



J] jFj = ^ 

j = 0 i=l j = 0 



M 



i=l 



M 



i=l 



= K 

= N 

In both of the above calculations, we made use of certain well- 
known tabulated results on sums involving the binomial coef- 
ficients, * 

The Loqnormal Distribution 

There is considerable evidence to show that the type dis- 
tribution obtained from any natural language is very closely 
approximated by a continuous distribution known as the 3 .oqnor - 
mal distribution. Accordingly, we now interrupt our discussion 
of vocabulary statistics to examine some of the general fea- 
tures of the lognormal distribution, 

*See, for example. Handbook of Mathematical Functions . Abramowitz, 
M. and Stegun, I. A. (eds.). Applied Mathematics Series f55. Na- 
tional Bureau of Standards, 1964, pp. 822-823. 
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Wg begin by considering the normal distribution, which is 
unquestionably the best-known of all the continuous distributions. 
A random variable X is said to be normally distributed if, for 
every real number x , the probability that X<x is given by the 
formula : 



The parameters |i and a which enter into this formula are pre- 
cisely the mean and the standard deviation of the distribution, 
respectively. Indeed, because of the symmetry and unimodality 
of the integrand, the parameter |i is simultaneously the mean , 
median, and mode of the distribution. The following graph of 
the integrand exhibits the fauniliar bell- shape of the "normal" 
curve: 




(10) 




MEAN 
MEDIAN 
r MODE 
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It is a trivial matter, of course, to shift the origin and 
change the scale, so that \i becomes 0 and a becomes 1. 
When this is done, we say that the random variable X is 
standardized . The following notations are the usual ones 
found in the literature: 

Integrand: ^ (x) 

Integral: JP{x) 

Complementary 
Integral: 



We shall have some occasion to refer to the functions J , 2P, 
and Q in what follows. 

Now let Y be another random variable, this time limited 
to positive values only. We say that Y is locmormallv dis- 
tributed if its logarithm X = In Y is normally distributed. 
For a given positive real nuniber y, then, the probability that 
Y<y is the same as the probability that X < In y , and is 
therefore given by the formula (10) with the substitution of 
In y for x as the upper limit of integration: 

If we change the variable of integration by the equation 

t = In s , we obtain the following formula for the lognormal 
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distribution function: 



Here we see that the lognormal distribution also involves two 
parameters [i and a , just as the parent normal distribution 
does. It is important to note, however, that \i and a no 
longer have the same interpretation as mean and standard de- 
viation; a bit of calculation shows that for the lognormal 
distribution (11) these quantities are given by: 

mean = e 

^ + rp — 

standard deviation = e Ve + 1 

and that the median and the mode no longer coincide with the 
mean. The following graph of the integrand of (11) is drawn 
for M- = 2, a = 1 , for the sake of illustration; note the 
different scales on the horizontal and vertical axes. 
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Because a lognormally distributed random variable assumes 
positive values only, it is possible to define moments of arbi- 
trary order (including fractional order) for such a variable. 
To be explicit, the g-th order moment of the distribution (ll) 
is obtained by inserting the factor s^' in the integrand and 
then integrating over the full range: 

M = f s"-^ exp [-Iln^zikl!] ds (12) 

A bit of simple calculus leads ^to an exact evaluation of this 
expression: exp a\i + ' 
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Again because we are dealing with positive values only, 
it makes sense to define momen t-di s tr ibut ion s of arbitrary 
order. These are defined just like the moments above, except 
that the upper limit of integration becomes variable again; 

the factor 1/M is also introduced to normalize the resulting 

' a 

integrals in the conventional manner. 

Thus the g-th order moment-distribution corresponding to 
(11) would be as follows: 

D fy) = / exp [-lliL^lds (13) 

a •'o 

In case a = 0 , the above expression coincides with (11), 
because Mq = 1 . Thus the zeroth order moment-distribution 
is identically the original distribution. 

Now we come to the remarkable self-replicating property 
which is peculiar to lognormal distributions. Upon performing 
a few elementary memipulations, we find that (13) can be re- 
written in the following revealing form: 

Comparison of (14) with (11) shows almost exact agreement; the 

only difference is that "ki" in (11) is replaced by 

"M, + a a^" in (14). This means that every moment-distribution 

of a lognormal distribution is itself lognormal . Moreover, 

there is a simple relation between the parameters: the "a" 

parameter is unchanged, and the "m." parameter becomes 

^l + a . 
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It will be convenient, if what follows, to have a special 
notation for the expressions which constantly recur in dis- 
cussions of lognormal distributions. Accordingly, we intro- 
duce the symbols^ and/i for the integrand and the integral 
of (11), as follows! 

Integrand: d)(y,H,a) = ^ exp -lilL_y=l^ 

y 

Integral: (y,[^,tj) = (s,[X,a)ds 



Vocabulary Statistics 

Now let us resume our discussion of vocabulary statistics. 
Our earlier results, embodied in Eqs. (l) through (9), were all 
exact; but now we shall introduce some approximations which 
will make the calculations more tractable. The basic assxamption 
is that the type distribution is, in fact, very nearly 

lognormal : 

Ai « <,(7ri,[x,a) (15) 

Since we already know that the token distribution A^ is the 
first-order moment-distribution of A. , it follows from the 
self-replication property of <C that A^ is also very nearly 
lognormal: 

A^ w «C(7ri,M' + a^,o) (16) 

Our problem, of course, is to determine the correct values 
of the parameters [i and a from a sampling of words. If 
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our basic assumption is correct, these two numbers will furnish 
a very concise and complete description of the entire lexicon 
from which the sample is drawn. 

A few words are in order at this pc Int concerning the 
nature of the approximation we are using. The true type 
distribution, after all, is discrete, because the lexicon 
contains only a finite number of types; yet we are assuming 
that it is well approximated by the lognormal distribution, 
which is continuous . It is by no means uncommon to pass from 
the discrete to the continuous in statistical work, but in this 
case we may have some problems, ^^specially at the upper end of 
the frequency (probability) scale. 

One difficulty is that th; types of highest frequency 
have widely and irregularly scattered frequencies. This is 
clearly shown by the seven most common types, whose frequencies 
(taken from the AHI corpus)* are as follows: 



Type 


Preauencv 


the 


0.0731 


of 


0. 0285 


and 


0.0262 


a 


0.0244 


to 


0.0236 


in 


0. 0194 


is 


0.0116 



Such departures from continuity indicate that our lognormal 
model becomes very unsatisfactory for the words which occur 



*John B. Carroll et al.. Word Frequency Book . American Heritage 
Publishing Co., New York, 1.971. 
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most frequently. 

Another difficulty is that the lognormal curve is un- 
bounded on the right. Since, for our applications, the ab- 
scissae represent word-type probabilities, this means that 
the lognormal model provides for types of arbiV-rarily high 
probability (i.e., even greater than l). Here again the model 
fails at the upper end. 

We have considered resolving these difficulties by chang- 
ing our model from lognormal to "truncated" lognormal. In 
other words, we might assume that the type distribution is 
lognormal up to a certain cut-off point; and beyond that point, 
it is discrete. There is no doubt that this assumption is 
more nearly true than that of 100 ^ lognormality. Some pre- 
liminary calculations, however, have indicated that there is 
very little difference in the final results — the values of 
II and a — obtained from the two assumptions. Accordingly, 
for the remainder of this discussion, we adopt the assumption 
that type and token distributions are completely lognormal. 

The passage from discrete to continuous representations 

entails a number of notational changes, and it may be well to 

take note of how these changes affect the appearance of our 

earlier results. The discrete variable tt. gives way to the 

J- * 

continuous variable tt , and similarly A. and A. become 
A(7r) and A (tt) . The finite svms are replaced by integrals, 
and the discrete differences AA^ become differentials 
dAiir) . Note that since we are assuming lognormality we have 

= ^{'^,\^»^) 
dA(7r) = <P(7r,|i,a)d7r 

With such changes in mind, we may rewrite Eq. (6), which gives 
a formula for the total vocabulary ^ , as follows: 
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Note that we have taken the upper limit of integration to be 
1; there is no point in making it any higher, since the vari- 
able TT represents a probability. In the same way our formu- 
la (9) for the incidence numbers may be rewritten as: 



Calculation of u and a 

Now that we have a grasp of the \inderlying mathematical 
theory, we ceui proceed to the problem of determining the values 
of |i sind a from the properties of an actual sample. Let the 
sample consist of N tokens altogether, and let us suppose that 
all the preliminary processing (such as lemmatization and re- 
solution of ambiguities) has been completed. 

We can then determine the sample incidence numbers Gj^ , 
^2^ ^3* W a straightforward count. Thus, is the 

number of types which are found to occur once each in the 
sample, Gj is the number of types which are found to occur 
twice each in the sample, and so on. Note the distinction be- 
tween the sample incidence numbers G^ and the theoretical in- 
cidence numbers previously mentioned: 

F. is the number of types which may >>e expected 




to occur exactly j times each in a sample of 
size N, assToming knowledge of the true type 
distribution . 
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G. 



is the number of types which actually did 
occur exactly j tim^s each in the sample. 



Note also that we have no direct way of observing the value 
of Gq , the number of types which failed to occur at all in 
the sample. 

Next we begin an iterative calculation? we simply guess 
at the values of \i and a to start the process. Naturally 
enough, the closer our initial guesses are to the true values, 
the more rapidly the iteration will converge. We could, for 
example, derive our initial values from the values already 
obtained in earlier vocabulary studies (e.g. the AHI corpus), 
after estimating the effect of lemmatization. * 

Prom the assumed values of \i and a we can calculate 

the total theoretical vocabulery * by Eq. (17). This gives 

us a way to estimate the missing sample incidence number 

G_ , since: 
o 



We can also calculate the theoretical incidence numbers 
(j=0, 1, 2, 3, ... ) by Eq. (18). This calculation itself 
presents some interesting problems. These problems are ex- 
plained in the final section of this appendix. 

Next, we compare the theoretical incidence numbers Fj 
with the actual sample incidence numbers G^ . If the two 
families of numbers agree substantially, then we are satis- 
fied that the assximed values of \i and a were correct, and 
we are finished. If not, then we must choose new values for 
|i and a ,. and repeat the process. 




(19) 



j >1 



*Lemmati2ation is discussed in Chapter n. 
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Two problems arise here. First, we must decide exactly 
what constitutes "substantial agreement" between the two 
families of numbers. Second, when this agreement is not 
achieved, we must know how to modify the values of \i and a 
so that the next "pass" will be more nearly successful. 

There are many ways to dispose of these two problems. We 
might very well adopt the procedure used by John B, Carroll in 
his statistical analysis of the AHI corpus. In broad outline, 
this procedure was as follows: 

1. Define two functions M(Xq, X^, X^, ) and 
S(X^, X , X , ... ). The exact choice of defini- 
tion made by Carroll is somewhat elaborate, and 
need not concern us here. Suffice it to say that 
M and S are the values of the sample parameters 
corresponding to \i and a , for a given set of 
incidence numbers X^, X^^, X2, ... . 

2. Evaluate both functions, first with the theoretical 
incidence numbers Fq, F^^ , F2, ... as arguments, and 
second with the sample incidence nuTiibers Gq, G^, 
G2i ... as arguments. 

3. If the evaluations agree, say to three decimal 
places, we are finished. If not, choose new para- 
meters [i* and a' by the "correction" equations: 

M(GQi G]_i G^, . . . ) 
= M(Fq, F^, f^, ...) ^ 

S(G^, G,, G , ...) 
S ( Fq, f^ , , . . , j 

(Note that the particular form of these correction 
equations is a result of the way the functions M and 
S were defined) . 



f/281-5-10-1 



A-19 



ERIC 



235 



4. Using the new values [i* and a' (instead of M' 
and a), recalculate the theoretical incidence 
numbers Fq, F^^, F^, ... (and also G^) , and return 
to step 2, 

This type of procedure, however, is not the only one available. 
The literature of numerical analysis abounds with methods of 
achieving convergence in an iterative calculation. The choice 
of method should be left open until the actual data are 
available. 

Calculation of Incidence iHimihere 

The actual calculation of the theoretical incidence 
nuin]:)ers Fj by Eq. (18) presents some formidable problems 
in numerical analysis: 



X 

F. = cfc^N^y 7r^(l-7r)N-3j) (7r,|x,a)d7r 



-(In TT- ) 
2a ^ 



dTT 

(20) 



The chief problem is the factor (I-tt) -' , which varies from 
1 to 0 in the interval of integration. Because of the immense 
size of N, it does so with great speedj this mesuis that almost 
all of the integral's value is contributed by the values of tt 
at the extreme left end of the interval. In this region the 
integrand varies so rapidly as to defy ordinary methods of 
numerical quadrature. 

One solution, which has been tested and Which seems to 
work very well, consists of approximating the integral (20) 
by a sum of integrals over subintervals. These subintervals 
are so chosen that the troublesome factor (I-tt)^"^ is es- 
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sentially constant in each of them. Specifically, we proceed 
as follows: 

1. Choose a small positive number e , say c = l/M 
where M is an integer. 

2. Let Hq = 1, Hj^= 1-e, H2 = l-2e, and so on. That 
is, - 1-ie. Then we have 

1 = Hq > > H2 > . . . > = 0 

N-j 



ifine the numbers tt^ by setting (I-tt^) = H^^ . 
at is, TT^ = 1 - (h^)^/^"^ . Since the form 



3. Define 
Thmt 

a decreasing sequence, the tt^ must form an in- 
creasing one: 



0 - -Hq < ir^ < < ... < TT^ = 1 

4. Now let S^ denote the siime integral as (20), but 
taken only over the i-th subinterval defined by the 
tt's: 

S.. = I'ly TrJ-'d-Tr)""-" exp | 'i'^' \dn 

Vl 

The desired quantity Pj is just the sum of these 
S^'s: 

M 



i=l 



5. Let T. be the same as S. , but without the 
factor (l-7r) in the integrand: 



P/281-5-10-1 



A-21 




237 



ERIC 



^i-i *- 



It happens to be a straightforward exercise to 
calculate T^^ explicitly in terms of the standard 
normal distribution function JP(x) . The result is: 



T. = *(N)exp[j|X + l|^j (P. - P^.i) 



where P 



6. Obviously we have 



H^T. < < H..^T. 



and so, summing over i, we obtain: 
M M 



>H.T. < P. < 7 H. ,T. 
X J 11 D / ^ 1-1 1 

i=l i=l 

7. The two sums which flank the preceding inequality 
are readily calculated. To do this, let us take 
out the leading factor of T^ (that is, 4> times 
the binomial coefficient times the exponential), 
which is independent of 1, and consider the follow- 
ing sums: 
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and 



EH. , (p. - P. , ) 
1-1^ 1 1-1' 

i=l 

A bit of manipulation ("sununation by parts"), to- 
gether with knowledge of the regular spacing of 
the , allows us to write these sums in a par- 
ticularly simple form: 

M M-l 
J]h.(P, . P,.^) = e^P, 
i=l i=l 



M-1 M 

2]«i-l(^ - ^i-l) =^2]Pi 
i=l i=l 

8. The quantities P^ required here are easily found, 
because they are simply values of the standard 
normal distribution function P(x). This function 
is well approximated by the first few terms of its 
Maclaurin series for small values of x, while for 
larger values we can use a rapidly convergent con- 
tinued-fraction expansion. 

Thus we have a non- tentative numerical procedure for cal- 
culating the F j . To summarize it, we have: 

M 

F. = <t>(^)exp k +ii2ile J^p^ (21) 

i=l 
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where the relative error is less than -jj;^ (which can be 



i=l 

made as small as desired by increasing the size of m) . 

There is need for considerable care, of course, in evalu- 
ating (21), especially for the larger values of j. This is 
because the leading factor (the binomial coefficient and the 
exponential) increases very rapidly with j to astronomical 
size, while the trailing factor (the sum of the P^'s ) de- 
creases with comparable speed. Treating the two factors sepa- 
rately would quickly lead to intolerable "noise" in the cal- 
culation. Once this is recognized, however, it is straight- 
forward to keep this situation under control. 
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